MiVOS
[CVPR 2021] Modular Interactive Video Object Segmentation: Interaction-to-Mask, Propagation and Difference-Aware Fusion. Semi-supervised VOS as well!
Install / Use
/learn @hkchengrex/MiVOSREADME
Modular Interactive Video Object Segmentation: Interaction-to-Mask, Propagation and Difference-Aware Fusion (MiVOS)
Ho Kei Cheng, Yu-Wing Tai, Chi-Keung Tang
CVPR 2021
[arXiv] [Paper PDF] [Project Page] [Demo] [Papers with Code] [Supplementary Material]
Newer: check out our new work Cutie. It also includes an interactive GUI!
New: see the STCN branch for a better and faster version.

<sub><sup>Credit (left to right): DAVIS 2017, Academy of Historical Fencing, Modern History TV</sup></sub>
We manage the project using three different repositories (which are actually in the paper title). This is the main repo, see also Mask-Propagation and Scribble-to-Mask.
Overall structure and capabilities
| | MiVOS | Mask-Propagation| Scribble-to-Mask | | ------------- |:-------------:|:-----:|:-----:| | DAVIS/YouTube semi-supervised evaluation | :x: | :heavy_check_mark: | :x: | | DAVIS interactive evaluation | :heavy_check_mark: | :x: | :x: | | User interaction GUI tool | :heavy_check_mark: | :x: | :x: | | Dense Correspondences | :x: | :heavy_check_mark: | :x: | | Train propagation module | :x: | :heavy_check_mark: | :x: | | Train S2M (interaction) module | :x: | :x: | :heavy_check_mark: | | Train fusion module | :heavy_check_mark: | :x: | :x: | | Generate more synthetic data | :heavy_check_mark: | :x: | :x: |
Framework

Requirements
We used these packages/versions in the development of this project. It is likely that higher versions of the same package will also work. This is not an exhaustive list -- other common python packages (e.g. pillow) are expected and not listed.
- PyTorch
1.7.1 - torchvision
0.8.2 - OpenCV
4.2.0 - Cython
- progressbar
- davis-interactive (https://github.com/albertomontesg/davis-interactive)
- PyQt5 for GUI
- networkx
2.4for DAVIS - gitpython for training
- gdown for downloading pretrained models
Refer to the official PyTorch guide for installing PyTorch/torchvision. The rest can be installed by:
pip install PyQt5 davisinteractive progressbar2 opencv-python networkx gitpython gdown Cython
Quick start
GUI
python download_model.pyto get all the required models.python interactive_gui.py --video <path to video>orpython interactive_gui.py --images <path to a folder of images>. A video has been prepared for you atexample/example.mp4.- If you need to label more than one object, additionally specify
--num_objects <number_of_objects>. See all the argument options withpython interactive_gui.py --help. - There are instructions in the GUI. You can also watch the demo videos for some ideas.
DAVIS Interactive VOS
See eval_interactive_davis.py. If you have downloaded the datasets and pretrained models using our script, you only need to specify the output path, i.e., python eval_interactive_davis.py --output [somewhere].
DAVIS/YouTube Semi-supervised VOS
Go to this repo: Mask-Propagation.
Main Results
DAVIS/YouTube semi-supervised results
DAVIS Interactive Track
All results are generated using the unmodified official DAVIS interactive bot without saving masks (--save_mask not specified) and with an RTX 2080Ti. We follow the official protocol.
Precomputed result, with the json summary: [Google Drive] [OneDrive]
eval_interactive_davis.py
| Model | AUC-J&F | J&F @ 60s | | --- |:--:|:---:| | Baseline | 86.0 | 86.6 | | (+) Top-k | 87.2 | 87.8 | | (+) BL30K pretraining | 87.4 | 88.0 | | (+) Learnable fusion | 87.6 | 88.2 | | (+) Difference-aware fusion (full model) | 87.9 | 88.5 | | Full model, without BL30K for propagation/fusion | 87.4 | 88.0 | | Full model, STCN backbone | 88.4 | 88.8 |
Pretrained models
python download_model.py should get you all the models that you need. (pip install gdown required.)
Training
Data preparation
Datasets should be arranged as the following layout. You can use download_datasets.py (same as the one Mask-Propagation) to get the DAVIS dataset and manually download and extract fusion_data ([OneDrive]) and BL30K.
├── BL30K
├── DAVIS
│ └── 2017
│ ├── test-dev
│ │ ├── Annotations
│ │ └── ...
│ └── trainval
│ ├── Annotations
│ └── ...
├── fusion_data
└── MiVOS
BL30K
BL30K is a synthetic dataset rendered using Blender with ShapeNet's data. We break the dataset into six segments, each with approximately 5K videos.
The videos are organized in a similar format as DAVIS and YouTubeVOS, so dataloaders for those datasets can be used directly. Each video is 160 frames long, and each frame has a resolution of 768*512. There are 3-5 objects per video, and each object has a random smooth trajectory -- we tried to optimize the trajectories greedily to minimize object intersection (not guaranteed), with occlusions still possible (happen a lot in reality). See generation/blender/generate_yaml.py for details.
We noted that using probably half of the data is sufficient to reach full performance (although we still used all), but using less than one-sixth (5K) is insufficient.
Download
Download via https://doi.org/10.13012/B2IDB-1702934_V1 Note that each segment is about 115GB in size -- 700GB in total. You are going to need ~1TB of free disk space to run the script (including extraction buffer).
MD5 Checksum:
35312550b9a75467b60e3b2be2ceac81 BL30K_a.tar
269e2f9ad34766b5f73fa117166c1731 BL30K_b.tar
a3f7c2a62028d0cda555f484200127b9 BL30K_c.tar
e659ed7c4e51f4c06326855f4aba8109 BL30K_d.tar
d704e86c5a6a9e920e5e84996c2e0858 BL30K_e.tar
bf73914d2888ad642bc01be60523caf6 BL30K_f.tar
Generation
- Download ShapeNet.
- Install Blender. (We used 2.82)
- Download a bunch of background and texture images. We used this repo (we specified "non-commercial reuse" in the script) and the list of keywords are provided in generation/blender/*.json.
- Generate a list of configuration files (generation/blender/generate_yaml.py).
- Run rendering on the configurations. See here (Not documented in detail, ask if you have a question)
Fusion data
We use the propagation module to run through some data and obtain real outputs to train the fusion module. See the script generate_fusion.py.
Or you can download pre-generated fusion data: [Google Drive] [OneDrive]
Training commands
These commands are to train the fusion module only.
CUDA_VISIBLE_DEVICES=[a,b] OMP_NUM_THREADS=4 python -m torch.distributed.launch --master_port [cccc] --nproc_per_node=2 train.py --id [defg] --stage [h]
We implemented training with Distributed Data Parallel (DDP) with two 11GB GPUs. Replace a, b with the GPU ids, cccc with an unused port number, defg with a unique experiment identifier, and h with the training stage (0/1).
The model is trained progressively with different stages (0: BL30K; 1: DAVIS). After each stage finishes, we start the next stage by loading the trained weight. A pretrained propagation model is required to train the fusion module.
One concrete example is:
Pre-training on the BL30K dataset: CUDA_VISIBLE_DEVICES=0,1 OMP_NUM_THREADS=4 python -m torch.distributed.launch --master_port 7550 --nproc_per_node=2 train.py --load_prop saves/propagation_model.pth --stage 0 --id retrain_s0
Main training: CUDA_VISIBLE_DEVICES=0,1 OMP_NUM_THREADS=4 python -m torch.distributed.launch --master_port 7550 --nproc_per_node=2 train.py --load_prop saves/propagation_model.pth --stage 1 --id retrain_s012 --load_network [path_to_trained_s0.pth]
Credit
f-BRS: https://github.com/saic-vul/fbrs_interactive_segmentation
ivs-demo: https://github.com/seoungwugoh/ivs-demo
deeplab: https://github.com/VainF/DeepLabV3Plus-Pytorch
STM: https://github.com/seoungwugoh/STM
BlenderProc: https://github.com/DLR-RM/BlenderProc
Citation
Please cite our paper if you find this repo useful!
Related Skills
qqbot-channel
349.2kQQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口,自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。
docs-writer
100.3k`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie
model-usage
349.2kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
arscontexta
3.0kClaude Code plugin that generates individualized knowledge systems from conversation. You describe how you think and work, have a conversation and get a complete second brain as markdown files you own.
