ZIM

[ICCV 2025, Highlight] ZIM: Zero-Shot Image Matting for Anything

Generate Convert Improve

Install / Use

/learn @naver-ai/ZIM

About this skill

Quality Score

0/100

README

ZIM: Zero-Shot Image Matting for Anything

Beomyoung Kim, Chanyong Shin, Joonhyun Jeong, Hyungsik Jung, Se-Yun Lee, Sewhan Chun, Dong-Hyun Hwang, Joonsang Yu<br>

<sub>NAVER Cloud, ImageVision</sub><br />

Teaser Teaser

Introduction

The recent segmentation foundation model, Segment Anything Model (SAM), exhibits strong zero-shot segmentation capabilities, but it falls short in generating fine-grained precise masks. To address this limitation, we propose a novel zero-shot image matting model, called ZIM, with two key contributions: First, we develop a label converter that transforms segmentation labels into detailed matte labels, constructing the new SA1B-Matte dataset without costly manual annotations. Training SAM with this dataset enables it to generate precise matte masks while maintaining its zero-shot capability. Second, we design the zero-shot matting model equipped with a hierarchical pixel decoder to enhance mask representation, along with a prompt-aware masked attention mechanism to improve performance by enabling the model to focus on regions specified by visual prompts. We evaluate ZIM using the newly introduced MicroMat-3K test set, which contains high-quality micro-level matte labels. Experimental results show that ZIM outperforms existing methods in fine-grained mask generation and zero-shot generalization. Furthermore, we demonstrate the versatility of ZIM in various downstream tasks requiring precise masks, such as image inpainting and 3D NeRF. Our contributions provide a robust foundation for advancing zero-shot matting and its downstream applications across a wide range of computer vision tasks.

Model overview

Updates

2025.07.24: ZIM has been accepted to ICCV 2025 as a Highlight Paper!
2024.11.04: official ZIM code update

Installation

Install the required packages with the command below:

pip install zim_anything

git clone https://github.com/naver-ai/ZIM.git
cd ZIM; pip install -e .

To enable GPU acceleration, please install the compatible onnxruntime-gpu package based on your environment settings (CUDA and CuDNN versions), following the instructions in the onnxruntime installation docs.

Demo

We provide a Gradio demo code in demo/gradio_demo.py. You can run our model demo locally by running:

python demo/gradio_demo.py

In addition, we provide a Gradio demo code demo/gradio_demo_comparison.py to qualitatively compare ZIM with SAM:

python demo/gradio_demo.py

Getting Started

After the installation step is done, you can utilize our model in just a few lines as below. ZimPredictor is compatible with SamPredictor, such as set_image() or predict().

from zim_anything import zim_model_registry, ZimPredictor

backbone = "vit_l"
ckpt_p = "results/zim_vit_l_2092"

model = zim_model_registry[backbone](checkpoint=ckpt_p)
if torch.cuda.is_available():
    model.cuda()

predictor = ZimPredictor(model)
predictor.set_image(<image>)
masks, _, _ = predictor.predict(<input_prompts>)

We also provide code for generating masks for an entire image and visualization:

from zim_anything import zim_model_registry, ZimAutomaticMaskGenerator
from zim_anything.utils import show_mat_anns

backbone = "vit_l"
ckpt_p = "results/zim_vit_l_2092"

model = zim_model_registry[backbone](checkpoint=ckpt_p)
if torch.cuda.is_available():
    model.cuda()

mask_generator = ZimAutomaticMaskGenerator(model)
masks = mask_generator.generate(<image>)  # Automatically generated masks
masks_vis = show_mat_anns(<image>, masks)  # Visualize masks

Additionally, masks can be generated for images from the command line:

bash script/run_amg.sh

Dataset Preparation

1) MicroMat-3K Dataset

We introduce a new test set named MicroMat-3K, to evaluate zero-shot interactive matting models. It consists of 3,000 high-resolution images paired with micro-level matte labels, providing a comprehensive benchmark for testing various matting models under different levels of detail.

Downloading MicroMat-3K dataset is available here or huggingface

1-1) Dataset structure

Dataset structure should be as follows:

└── /path/to/dataset/MicroMat3K
    ├── img
    │   ├── 0001.png
    ├── matte
    │   ├── coarse
    │   │   ├── 0001.png
    │   └── fine
    │       ├── 0001.png
    ├── prompt
    │   ├── coarse
    │   │   ├── 0001.png
    │   └── fine
    │       ├── 0001.png
    └── seg
        ├── coarse
        │   ├── 0001_01.json
        └── fine
            ├── 0001_01.json

1-2) Prompt file configuration

Prompt file configuration should be as follows:

{
    "point": [[x1, y1, 1], [x2, y2, 0], ...],   # 1: Positive, 0: Negative prompt
    "bbox": [x1, y1, x2, y2]                    # [X, Y, X, Y] format
}

Evaluation

We provide an evaluation script, which includes a comparison with SAM, in script/run_eval.sh. Make sure the dataset structure is prepared.

First, modify data_root in script/run_eval.sh

...
data_root="/path/to/dataset/"
...

Then, run evaluation script file.

bash script/run_eval.sh

The evaluation result on the MicroMat-3K dataset would be as follows:

Table

How To Cite

@article{kim2024zim,
  title={ZIM: Zero-Shot Image Matting for Anything},
  author={Kim, Beomyoung and Shin, Chanyong and Jeong, Joonhyun and Jung, Hyungsik and Lee, Se-Yun and Chun, Sewhan and Hwang, Dong-Hyun and Yu, Joonsang},
  journal={arXiv preprint arXiv:2411.00626},
  year={2024}
}

License

ZIM
Copyright (c) 2024-present NAVER Cloud Corp.
CC BY-NC 4.0 (https://creativecommons.org/licenses/by-nc/4.0/)

Related Skills

node-connect

349.2k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

109.5k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

349.2k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

349.2k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。