SOLA
Official implementation of "Referring Video Object Segmentation via Language Aligned Track Selection".
Install / Use
/learn @cvlab-kaist/SOLAREADME
Environment Settings
conda create -n SOLA python=3.10
conda activate SOLA
pip install -r requirements.txt
As our work requires SAM2 and GroundingDINO, please follow the Installation guides [SAM2, GroundingDINO] from each repository.
You need to clone each repository in track_generation directory.
cd track_generation
git clone https://github.com/facebookresearch/sam2.git
cd sam2
(continue with instructions in SAM2 repository)
git clone https://github.com/IDEA-Research/GroundingDINO.git
cd GroundingDINO
(continue with instructions in GroundingDINO repository)
cd ../..
Dataset Preparation
You have to download MeViS and Ref-Youtube-VOS in datasets folder.
The datasets have to be prepared like this:
dataset/
mevis/
train/
JPEGImages/
meta_expressions.json
mask_dict.json
valid_u/
...
valid/
...
ref-ytbvos/
train/
Annotations/
JPEGImages/
meta_expressions.json
valid/
...
Track Generation
For Track Generation, use the code inside the track_generation directory, which assumes to include both the SAM2 and GroundingDINO repositories.
Each dataset and split requires both Prompt Generation and Track Generation.
You can refer to the scripts directory for usage examples.
MeViS train / valid_u / valid
# MeViS (train) - GT
CUDA_VISIBLE_DEVICES=0 python generate_tokens_GT_mevis.py --dataset mevis --data_type train --pid 0 --n_pids 1
# MeViS (train) - GRID
CUDA_VISIBLE_DEVICES=0 python generate_prompts_grid.py --dataset mevis --data_type train --bin_size 4 --pid 0 --n_pid 1
CUDA_VISIBLE_DEVICES=0 python generate_tokens_grid.py --dataset mevis --data_type train --bin_size 4 --batch_size 4 --miou_thresh 0.7 --n_max_tracks 64 --pid 0 --n_pids 1
# MeViS (valid_u / valid) - GroundingDINO
CUDA_VISIBLE_DEVICES=0 python generate_prompts_gdino.py --dataset mevis --data_type valid_u --bin_size 4 --box_threshold 0.2 --text_threshold 0.25 --pid 0 --n_pid 1
CUDA_VISIBLE_DEVICES=0 python generate_tokens_gdino.py --dataset mevis --data_type valid_u --bin_size 4 --batch_size 4 --miou_thresh 0.7 --stability_score_thresh 0.85 --n_max_tracks 16 --pid 0 --n_pids 1
# MeViS (valid_u / valid) - GRID
CUDA_VISIBLE_DEVICES=0 python generate_prompts_grid.py --dataset mevis --data_type valid_u --bin_size 0 --pid 0 --n_pid 1
CUDA_VISIBLE_DEVICES=0 python generate_tokens_grid.py --dataset mevis --data_type valid_u --bin_size 4 --batch_size 4 --miou_thresh 0.7 --n_max_tracks 64 --pid 0 --n_pids 1
Ref-Youtube-VOS train / valid
# Ref-Youtube-VOS (train) - GT
CUDA_VISIBLE_DEVICES=0 python generate_tokens_GT_ytbvos.py --dataset ref-ytbvos --data_type train --pid 0 --n_pid 1
# Ref-Youtube-VOS (train) - GRID
CUDA_VISIBLE_DEVICES=0 python generate_prompts_grid.py --dataset ref-ytbvos --data_type train --bin_size 4 --pid 0 --n_pid 1
CUDA_VISIBLE_DEVICES=0 python generate_tokens_grid.py --dataset ref-ytbvos --data_type train --bin_size 4 --batch_size 4 --miou_thresh 0.7 --n_max_tracks 64 --pid 0 --n_pids 1
# Ref-Youtube-VOS (valid) - GroundingDINO
CUDA_VISIBLE_DEVICES=0 python generate_prompts_gdino.py --dataset ref-ytbvos --data_type valid --bin_size 4 --box_threshold 0.2 --text_threshold 0.25 --pid 0 --n_pid 1
CUDA_VISIBLE_DEVICES=0 python generate_tokens_gdino.py --dataset ref-ytbvos --data_type valid --bin_size 4 --batch_size 4 --miou_thresh 0.7 --stability_score_thresh 0.85 --n_max_tracks 16 --pid 0 --n_pids 1
# Ref-Youtube-VOS (valid) - GRID
CUDA_VISIBLE_DEVICES=0 python generate_prompts_grid.py --dataset ref-ytbvos --data_type valid --bin_size 0 --pid 0 --n_pid 1
CUDA_VISIBLE_DEVICES=0 python generate_tokens_grid.py --dataset ref-ytbvos --data_type valid --bin_size 4 --batch_size 4 --miou_thresh 0.7 --n_max_tracks 64 --pid 0 --n_pids 1
These will generate SAM2 object tokens and corresponding masklets in sam2_tracks directory.
Track Selection
After generating SAM2 tracks and object tokens, you have to train and inference to obatain the final results.
You can use the scripts directory for simple usage.
# Training
sh train.sh mevis/default
# Evaluation
sh eval.sh mevis/default [epoch] --eval_pred_threshold [threshold]
# Inference
sh inference.sh mevis/default [epoch] --eval_pred_threshold [threshold]
To obtain Zero-shot results:
# Zero-shot Evaluation
sh eval.sh mevis/zeroshot [epoch] --eval_pred_threshold [threshold]
# Zero-shot Inference
sh inference.sh mevis/zeroshot [epoch] --eval_pred_threshold [threshold]
<!-- ## Models
☁️ [Google Drive](???)
## Acknowledgement
This project is based on ???. Many thanks to the authors for their great works! -->
BibTeX
<p id="BibTeX"></p>Please consider to cite SOLA if it helps your research.
@article{kim2024referring,
title={Referring Video Object Segmentation via Language-aligned Track Selection},
author={Kim, Seongchan and Jin, Woojeong and Lim, Sangbeom and Yoon, Heeji and Choi, Hyunwook and Kim, Seungryong},
journal={arXiv preprint arXiv:2412.01136},
year={2024}
}
