ReferDINO

(ICCV 2025) ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations

Generate Convert Improve

Install / Use

/learn @iSEE-Laboratory/ReferDINO

About this skill

Quality Score

0/100

README

<div align="center"> <h2>ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations</h2>

Tianming Liang¹ Kun-Yu Lin¹ Chaolei Tan¹ Jianguo Zhang² Wei-Shi Zheng¹ Jian-Fang Hu¹*

¹Sun Yat-sen University ²Southern University of Science and Technology

ICCV 2025

<h3 align="center"> <a href="https://isee-laboratory.github.io/ReferDINO/" target='_blank'>Project Page</a> | <a href="https://arxiv.org/abs/2501.14607" target='_blank'>Paper</a> | <a href="https://huggingface.co/spaces/liangtm/referdino" target='_blank'>Demo</a> </h3> </div>

visual

📢 News

2025.8.09 Our demo is available on HuggingFace Space! Try here!
2025.8.09 Demo script is available.
2025.6.28 Swin-B checkpoints are released.
2025.6.27 All training and inference code for ReferDINO is released.
2025.6.25 ReferDINO is accepted to ICCV 2025 ! 🎉
2025.3.28 Our ReferDINO-Plus, an ensemble approach of ReferDINO and SAM2, achieved the 2nd place in PVUW challenge RVOS Track at CVPR 2025! 🎉 See our report for details!

👨‍💻 TODO

[X] Release demo code and online demo.
[X] Release model weights.
[X] Release training and inference code.

🔎 Framework

model

Environment Setup

We have tested our code in PyTorch 1.11 and 2.5.1, so either version should be compatible.

# Clone the repo
git clone https://github.com/iSEE-Laboratory/ReferDINO.git
cd ReferDINO

# [Optional] Create a clean Conda environment
conda create -n referdino python=3.10 -y
conda activate referdino

# Pytorch
conda install pytorch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1  pytorch-cuda=11.8 -c pytorch -c nvidia

# MultiScaleDeformableAttention
cd models/GroundingDINO/ops
python setup.py build install
python test.py
cd ../../..

# Other dependencies
pip install -r requirements.txt

Download pretrained GroundingDino as follows and put them in the diretory pretrained.

wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha2/groundingdino_swinb_cogcoor.pth

Try ReferDINO on your video

We provide a script to quickly apply ReferDINO to the given video and text.

python demo_video.py <video_path> --text "a description for the target" -ckpt ckpt/ryt_mevis_swinb.pth

Data Preparation

Please refer to DATA.md for data preparation.

The directory struture is organized as follows.

ReferDINO/
├── configs
├── data
│   ├── coco
│   ├── a2d_sentences
│   ├── jhmdb_sentences
│   ├── mevis
│   ├── ref_davis
│   └── ref_youtube_vos
├── datasets
├── models
├── eval
├── tools
├── util
├── pretrained
├── ckpt
├── misc.py
├── pretrainer.py
├── trainer.py
└── main.py

Get Started

The results will be saved in output/{dataset}/{version}/. If you encounter OOM errors, please reduce the batch_size or the num_frames in config file.

Pretrain Swin-B on coco datasets with 8 GPUs. You can either specify the gpu indices with --gids 0 1 2 3.

python main.py -c configs/coco_swinb.yaml -rm pretrain -bs 12 -ng 8 --epochs 20 --version swinb --eval_off

Finetuning on Refer-YouTube-VOS with the pretrained checkpoints.

python main.py -c configs/ytvos_swinb.yaml -rm train -bs 2 -ng 8 --version swinb -pw ckpt/coco_swinb.pth --eval_off

Inference on Refer-YouTube-VOS.

PYTHONPATH=. python eval/inference_ytvos.py -c configs/ytvos_swinb.yaml -ng 8 -ckpt ckpt/ryt_swinb.pth --version swinb

Inference on MeViS Valid Set.

PYTHONPATH=. python eval/inference_mevis.py --split valid -c configs/mevis_swinb.yaml -ng 8 ckpt/mevis_swinb.pth --version swinb

Inference on A2D-Sentences (or JHMDB-Sentences).

PYTHONPATH=. python main.py -c configs/a2d_swinb.yaml -rm train -ng 8 --version swinb -ckpt ckpt/a2d_swinb.pth --eval

Model Zoo

We have released the following model weights on HuggingFace. Please download and put them in the diretory ckpt.

| Train Set | Backbone | Checkpoint | |-----------------------|:-------------:|-------------------------------------------------------------------------------------| | coco | Swin-B | coco_swinb.pth | | coco, ref-youtube-vos | Swin-B | ryt_swinb.pth | | coco, a2d-sentences | Swin-B | a2d_swinb.pth | | mevis | Swin-B | mevis_swinb.pth | | coco, ref-youtube-vos, mevis | Swin-B | ryt_mevis_swinb.pth |

Acknowledgements

Our code is built upon ReferFormer, SOC and GroundingDINO. We sincerely appreciate these efforts.

Citation

If you find our work helpful for your research, please consider citing our paper.

@inproceedings{liang2025referdino,
    author    = {Liang, Tianming and Lin, Kun-Yu and Tan, Chaolei and Zhang, Jianguo and Zheng, Wei-Shi and Hu, Jian-Fang},
    title     = {ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2025},
    pages     = {20009-20019}
}

Related Skills

docs-writer

98.8k

`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie

model-usage

331.7k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

Design

Campus Second-Hand Trading Platform \- General Design Document (v5.0 \- React Architecture \- Complete Final Version)1\. System Overall Design 1.1. Project Overview This project aims t

arscontexta

2.8k

Claude Code plugin that generates individualized knowledge systems from conversation. You describe how you think and work, have a conversation and get a complete second brain as markdown files you own.

iSEE-Laboratory

View profile

View on GitHub

GitHub Stars133

CategoryContent

Updated7d ago

Forks13

iSEE-Laboratory/ReferDINO

Languages

Python

Security Score

80/100

Audited on Mar 16, 2026

No findings