OnlineRefer
[ICCV 2023] OnlineRefer: A Simple Online Baseline for Referring Video Object Segmentation
Install / Use
/learn @wudongming97/OnlineReferREADME
OnlineRefer: A Simple Online Baseline for Referring Video Object Segmentation
Dongming Wu, Tiancai Wang, Yuang Zhang, Xiangyu Zhang, Jianbing Shen
Abstract
Referring video object segmentation (RVOS) aims at segmenting an object in a video following human instruction. Current state-of-the-art methods fall into an offline pattern, in which each clip independently interacts with text embedding for cross-modal understanding. They usually present that the offline pattern is necessary for RVOS, yet model limited temporal association within each clip. In this work, we break up the previous offline belief and propose a simple yet effective online model using explicit query propagation, named OnlineRefer. Specifically, our approach leverages target cues that gather semantic information and position prior to improve the accuracy and ease of referring predictions for the current frame. Furthermore, we generalize our online model into a semi-online framework to be compatible with video-based backbones. To show the effectiveness of our method, we evaluate it on four benchmarks, \ie, Refer-Youtube-VOS, Refer-DAVIS17, A2D-Sentences, and JHMDB-Sentences. Without bells and whistles, our OnlineRefer with a Swin-L backbone achieves 63.5 J&F and 64.8 J&F on Refer-Youtube-VOS and Refer-DAVIS17, outperforming all other offline methods.
Update
- (2023/07/18) OnlineRefer is accepted by ICCV2023. The online mode is released.
Setup
The main setup of our code follows Referformer.
Please refer to install.md for installation.
Please refer to data.md for data preparation.
Training and Evaluation
If you want to train and evaluate our online model on Ref-Youtube-VOS using backbone ResNet50, please run the following command:
sh ./scripts/online_ytvos_r50.sh
If you want to train and evaluate our online model on Ref-Youtube-VOS using backbone Swin-L, please run the following command:
sh ./scripts/online_ytvos_swinl.sh
If you want to use your own video sequence, please run the following command:
python inference_long_videos.py
Note: The models with ResNet50 are trained using 8 NVIDIA 2080Ti GPU, and the models with Swin-L are trained using 8 NVIDIA Tesla V100 GPU.
Model Zoo
Ref-Youtube-VOS
Please upload the zip file to the competition server.
| Backbone| J&F | J | F | Pretrain | Model | Submission | | :----: |:-----:|:-----:|:----:|:------------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------------------:| | ResNet-50 | 57.3 | 55.6 | 58.9 | weight | model | link | | Swin-L | 63.5 | 61.6 | 65.5 | weight | model | link | | Video Swin-B | 62.9 | 61.0 | 64.7 | - | - |link |
Ref-DAVIS17
As described in the paper, we report the results using the model trained on Ref-Youtube-VOS without finetune.
| Backbone | J&F | J | F | Model | |:------------:|:----:|:----:|:----:|:------------------------------------------------------------------------------------------------:| | ResNet-50 | 59.3 | 55.7 | 62.9 | model | | Swin-L | 64.8 | 61.6 | 67.7 | model |
Citation
If you find OnlineRefer useful in your research, please consider citing:
@inproceedings{wu2023onlinerefer,
title={OnlineRefer: A Simple Online Baseline for Referring Video Object Segmentation},
author={Wu, Dongming and Wang, Tiancai and Zhang, Yuang and Zhang, Xiangyu and Shen, Jianbing},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={2761--2770},
year={2023}
}
Acknowledgement
Related Skills
docs-writer
98.9k`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie
model-usage
332.9kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
Design
Campus Second-Hand Trading Platform \- General Design Document (v5.0 \- React Architecture \- Complete Final Version)1\. System Overall Design 1.1. Project Overview This project aims t
arscontexta
2.8kClaude Code plugin that generates individualized knowledge systems from conversation. You describe how you think and work, have a conversation and get a complete second brain as markdown files you own.
