OnlineRefer

[ICCV 2023] OnlineRefer: A Simple Online Baseline for Referring Video Object Segmentation

Generate Convert Improve

Install / Use

/learn @wudongming97/OnlineRefer

About this skill

Quality Score

0/100

README

<div align="center"> <h1> <b> OnlineRefer: A Simple Online Baseline for Referring Video Object Segmentation </b> </h1> </div> <p align="center"><img src="docs/onlinerefer.jpg" width="800"/></p>

OnlineRefer: A Simple Online Baseline for Referring Video Object Segmentation

Dongming Wu, Tiancai Wang, Yuang Zhang, Xiangyu Zhang, Jianbing Shen

Abstract

Referring video object segmentation (RVOS) aims at segmenting an object in a video following human instruction. Current state-of-the-art methods fall into an offline pattern, in which each clip independently interacts with text embedding for cross-modal understanding. They usually present that the offline pattern is necessary for RVOS, yet model limited temporal association within each clip. In this work, we break up the previous offline belief and propose a simple yet effective online model using explicit query propagation, named OnlineRefer. Specifically, our approach leverages target cues that gather semantic information and position prior to improve the accuracy and ease of referring predictions for the current frame. Furthermore, we generalize our online model into a semi-online framework to be compatible with video-based backbones. To show the effectiveness of our method, we evaluate it on four benchmarks, \ie, Refer-Youtube-VOS, Refer-DAVIS17, A2D-Sentences, and JHMDB-Sentences. Without bells and whistles, our OnlineRefer with a Swin-L backbone achieves 63.5 J&F and 64.8 J&F on Refer-Youtube-VOS and Refer-DAVIS17, outperforming all other offline methods.

Update

(2023/07/18) OnlineRefer is accepted by ICCV2023. The online mode is released.

Setup

The main setup of our code follows Referformer.

Please refer to install.md for installation.

Please refer to data.md for data preparation.

Training and Evaluation

If you want to train and evaluate our online model on Ref-Youtube-VOS using backbone ResNet50, please run the following command:

sh ./scripts/online_ytvos_r50.sh

If you want to train and evaluate our online model on Ref-Youtube-VOS using backbone Swin-L, please run the following command:

sh ./scripts/online_ytvos_swinl.sh

If you want to use your own video sequence, please run the following command:

python inference_long_videos.py

Note: The models with ResNet50 are trained using 8 NVIDIA 2080Ti GPU, and the models with Swin-L are trained using 8 NVIDIA Tesla V100 GPU.

Model Zoo

Ref-Youtube-VOS

Please upload the zip file to the competition server.

| Backbone| J&F | J | F | Pretrain | Model | Submission | | :----: |:-----:|:-----:|:----:|:------------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------------------:| | ResNet-50 | 57.3 | 55.6 | 58.9 | weight | model | link | | Swin-L | 63.5 | 61.6 | 65.5 | weight | model | link | | Video Swin-B | 62.9 | 61.0 | 64.7 | - | - |link |

Ref-DAVIS17

As described in the paper, we report the results using the model trained on Ref-Youtube-VOS without finetune.

| Backbone | J&F | J | F | Model | |:------------:|:----:|:----:|:----:|:------------------------------------------------------------------------------------------------:| | ResNet-50 | 59.3 | 55.7 | 62.9 | model | | Swin-L | 64.8 | 61.6 | 67.7 | model |

Citation

If you find OnlineRefer useful in your research, please consider citing:

@inproceedings{wu2023onlinerefer,
  title={OnlineRefer: A Simple Online Baseline for Referring Video Object Segmentation},
  author={Wu, Dongming and Wang, Tiancai and Zhang, Yuang and Zhang, Xiangyu and Shen, Jianbing},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={2761--2770},
  year={2023}
}

Acknowledgement

Related Skills

docs-writer

98.9k

`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie

model-usage

332.9k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

Design

Campus Second-Hand Trading Platform \- General Design Document (v5.0 \- React Architecture \- Complete Final Version)1\. System Overall Design 1.1. Project Overview This project aims t

arscontexta

2.8k

Claude Code plugin that generates individualized knowledge systems from conversation. You describe how you think and work, have a conversation and get a complete second brain as markdown files you own.

wudongming97

View profile

View on GitHub

GitHub Stars58

CategoryContent

Updated29d ago

Forks4

wudongming97/OnlineRefer

Languages

Python

Security Score

95/100

Audited on Feb 22, 2026

No findings