MiVOS

[CVPR 2021] Modular Interactive Video Object Segmentation: Interaction-to-Mask, Propagation and Difference-Aware Fusion. Semi-supervised VOS as well!

Generate Convert Improve

Install / Use

/learn @hkchengrex/MiVOS

About this skill

Quality Score

0/100

README

Modular Interactive Video Object Segmentation: Interaction-to-Mask, Propagation and Difference-Aware Fusion (MiVOS)

Ho Kei Cheng, Yu-Wing Tai, Chi-Keung Tang

CVPR 2021

[arXiv] [Paper PDF] [Project Page] [Demo] [Papers with Code] [Supplementary Material]

Newer: check out our new work Cutie. It also includes an interactive GUI!

New: see the STCN branch for a better and faster version.

demo1 demo2 demo3

<sub><sup>Credit (left to right): DAVIS 2017, Academy of Historical Fencing, Modern History TV</sup></sub>

We manage the project using three different repositories (which are actually in the paper title). This is the main repo, see also Mask-Propagation and Scribble-to-Mask.

Overall structure and capabilities

| | MiVOS | Mask-Propagation| Scribble-to-Mask | | ------------- |:-------------:|:-----:|:-----:| | DAVIS/YouTube semi-supervised evaluation | :x: | :heavy_check_mark: | :x: | | DAVIS interactive evaluation | :heavy_check_mark: | :x: | :x: | | User interaction GUI tool | :heavy_check_mark: | :x: | :x: | | Dense Correspondences | :x: | :heavy_check_mark: | :x: | | Train propagation module | :x: | :heavy_check_mark: | :x: | | Train S2M (interaction) module | :x: | :x: | :heavy_check_mark: | | Train fusion module | :heavy_check_mark: | :x: | :x: | | Generate more synthetic data | :heavy_check_mark: | :x: | :x: |

Framework

framework

Requirements

We used these packages/versions in the development of this project. It is likely that higher versions of the same package will also work. This is not an exhaustive list -- other common python packages (e.g. pillow) are expected and not listed.

PyTorch 1.7.1
torchvision 0.8.2
OpenCV 4.2.0
Cython
progressbar
davis-interactive (https://github.com/albertomontesg/davis-interactive)
PyQt5 for GUI
networkx 2.4 for DAVIS
gitpython for training
gdown for downloading pretrained models

Refer to the official PyTorch guide for installing PyTorch/torchvision. The rest can be installed by:

pip install PyQt5 davisinteractive progressbar2 opencv-python networkx gitpython gdown Cython

Quick start

GUI

python download_model.py to get all the required models.
python interactive_gui.py --video <path to video> or python interactive_gui.py --images <path to a folder of images>. A video has been prepared for you at example/example.mp4.
If you need to label more than one object, additionally specify --num_objects <number_of_objects>. See all the argument options with python interactive_gui.py --help.
There are instructions in the GUI. You can also watch the demo videos for some ideas.

DAVIS Interactive VOS

See eval_interactive_davis.py. If you have downloaded the datasets and pretrained models using our script, you only need to specify the output path, i.e., python eval_interactive_davis.py --output [somewhere].

DAVIS/YouTube Semi-supervised VOS

Go to this repo: Mask-Propagation.

Main Results

DAVIS/YouTube semi-supervised results

DAVIS Interactive Track

All results are generated using the unmodified official DAVIS interactive bot without saving masks (--save_mask not specified) and with an RTX 2080Ti. We follow the official protocol.

Precomputed result, with the json summary: [Google Drive] [OneDrive]

eval_interactive_davis.py

| Model | AUC-J&F | J&F @ 60s | | --- |:--:|:---:| | Baseline | 86.0 | 86.6 | | (+) Top-k | 87.2 | 87.8 | | (+) BL30K pretraining | 87.4 | 88.0 | | (+) Learnable fusion | 87.6 | 88.2 | | (+) Difference-aware fusion (full model) | 87.9 | 88.5 | | Full model, without BL30K for propagation/fusion | 87.4 | 88.0 | | Full model, STCN backbone | 88.4 | 88.8 |

Pretrained models

python download_model.py should get you all the models that you need. (pip install gdown required.)

[OneDrive Mirror]

Training

Data preparation

Datasets should be arranged as the following layout. You can use download_datasets.py (same as the one Mask-Propagation) to get the DAVIS dataset and manually download and extract fusion_data ([OneDrive]) and BL30K.

├── BL30K
├── DAVIS
│   └── 2017
│       ├── test-dev
│       │   ├── Annotations
│       │   └── ...
│       └── trainval
│           ├── Annotations
│           └── ...
├── fusion_data
└── MiVOS

BL30K

BL30K is a synthetic dataset rendered using Blender with ShapeNet's data. We break the dataset into six segments, each with approximately 5K videos. The videos are organized in a similar format as DAVIS and YouTubeVOS, so dataloaders for those datasets can be used directly. Each video is 160 frames long, and each frame has a resolution of 768*512. There are 3-5 objects per video, and each object has a random smooth trajectory -- we tried to optimize the trajectories greedily to minimize object intersection (not guaranteed), with occlusions still possible (happen a lot in reality). See generation/blender/generate_yaml.py for details.

We noted that using probably half of the data is sufficient to reach full performance (although we still used all), but using less than one-sixth (5K) is insufficient.

Download

Download via https://doi.org/10.13012/B2IDB-1702934_V1 Note that each segment is about 115GB in size -- 700GB in total. You are going to need ~1TB of free disk space to run the script (including extraction buffer).

MD5 Checksum:

35312550b9a75467b60e3b2be2ceac81  BL30K_a.tar
269e2f9ad34766b5f73fa117166c1731  BL30K_b.tar
a3f7c2a62028d0cda555f484200127b9  BL30K_c.tar
e659ed7c4e51f4c06326855f4aba8109  BL30K_d.tar
d704e86c5a6a9e920e5e84996c2e0858  BL30K_e.tar
bf73914d2888ad642bc01be60523caf6  BL30K_f.tar

Generation

Download ShapeNet.
Install Blender. (We used 2.82)
Download a bunch of background and texture images. We used this repo (we specified "non-commercial reuse" in the script) and the list of keywords are provided in generation/blender/*.json.
Generate a list of configuration files (generation/blender/generate_yaml.py).
Run rendering on the configurations. See here (Not documented in detail, ask if you have a question)

Fusion data

We use the propagation module to run through some data and obtain real outputs to train the fusion module. See the script generate_fusion.py.

Or you can download pre-generated fusion data: [Google Drive] [OneDrive]

Training commands

These commands are to train the fusion module only.

CUDA_VISIBLE_DEVICES=[a,b] OMP_NUM_THREADS=4 python -m torch.distributed.launch --master_port [cccc] --nproc_per_node=2 train.py --id [defg] --stage [h]

We implemented training with Distributed Data Parallel (DDP) with two 11GB GPUs. Replace a, b with the GPU ids, cccc with an unused port number, defg with a unique experiment identifier, and h with the training stage (0/1).

The model is trained progressively with different stages (0: BL30K; 1: DAVIS). After each stage finishes, we start the next stage by loading the trained weight. A pretrained propagation model is required to train the fusion module.

One concrete example is:

Pre-training on the BL30K dataset: CUDA_VISIBLE_DEVICES=0,1 OMP_NUM_THREADS=4 python -m torch.distributed.launch --master_port 7550 --nproc_per_node=2 train.py --load_prop saves/propagation_model.pth --stage 0 --id retrain_s0

Main training: CUDA_VISIBLE_DEVICES=0,1 OMP_NUM_THREADS=4 python -m torch.distributed.launch --master_port 7550 --nproc_per_node=2 train.py --load_prop saves/propagation_model.pth --stage 1 --id retrain_s012 --load_network [path_to_trained_s0.pth]

Credit

f-BRS: https://github.com/saic-vul/fbrs_interactive_segmentation

ivs-demo: https://github.com/seoungwugoh/ivs-demo

deeplab: https://github.com/VainF/DeepLabV3Plus-Pytorch

STM: https://github.com/seoungwugoh/STM

BlenderProc: https://github.com/DLR-RM/BlenderProc

Citation

Please cite our paper if you find this repo useful!

Related Skills

qqbot-channel

349.2k

QQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口，自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。

docs-writer

100.3k

`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie

model-usage

349.2k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

arscontexta

3.0k

Claude Code plugin that generates individualized knowledge systems from conversation. You describe how you think and work, have a conversation and get a complete second brain as markdown files you own.

hkchengrex

View profile

View on GitHub

GitHub Stars485

CategoryContent

Updated14d ago

Forks63

hkchengrex/MiVOS

Languages

Python

Security Score

100/100

Audited on Mar 23, 2026

No findings