InterRVOS
Official implementation of "InterRVOS: Interaction-aware Referring Video Object Segmentation".
Install / Use
/learn @cvlab-kaist/InterRVOSREADME
Woojeong Jin Seongchan Kim Jaeho Lee Seungryong Kim† <br> KAIST AI <br> †: Corresponding Author
ArXiv 2025
<a href="https://arxiv.org/abs/2506.02356"> <img src="https://img.shields.io/badge/arXiv-2506.02356-B31B1B?logo=arxiv&logoColor=white"> </a> <a href="https://cvlab-kaist.github.io/InterRVOS/"> <img src="https://img.shields.io/badge/Project_Page-Available-1E90FF"> </a> <a href="https://huggingface.co/wooj0216/ReVIOSa-4B"> <img src="https://img.shields.io/badge/🤗 Huggingface Models-Available-6A5ACD" > </a> <a href="https://huggingface.co/datasets/wooj0216/InterRVOS-127K"> <img src="https://img.shields.io/badge/Dataset-Available-20B2AA" > </a> </div>📢 News
- [x] Upcoming: InterRVOS-127K dataset and ReVIOSa checkpoints
- [x] Upcoming : Data annotation pipeline
- [x] Released: Training code, inference & evaluation code
- [x] Released: InterRVOS on ArXiv and Project Page
🎯 Release Progress
- [x] Model checkpoints
- [x] InterRVOS-127K dataset (Training & Evaluation)
- [x] Data annotation pipeline code
- [x] Inference & evaluation code
- [x] Training code
Overview
This repository contains the code for the paper InterRVOS: Interaction-aware Referring Video Object Segmentation.
In this paper, we introduce Interaction-aware Referring Video Object Segmentation (InterRVOS), a novel task that focuses on the modeling of interactions. It requires the model to segment the <b>actor</b> and <b>target</b> objects separately, reflecting their asymmetric roles in an interaction. Please refer to the project page for detailed visualization results.
Model Download
‼️ We release the pretrained ReVIOSa-1B and ReVIOSa-4B model on Hugging Face 🤗: ReVIOSa-1B and ReVIOSa-4B
🚀 Quick Start
import torch
from transformers import AutoTokenizer, AutoModel
from PIL import Image
import numpy as np
import os
# load the model and tokenizer
path = "wooj0216/ReVIOSa-4B"
model = AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
video_folder = "/PATH/TO/VIDEO_FOLDER"
images_paths = os.listdir(video_folder)
images_paths = [os.path.join(video_folder, image_path) for image_name in images_paths]
text_prompts = "<image>Please segment the child reaching out to man."
input_dict = {
'video': images_paths,
'text': text_prompts,
'past_text': '',
'mask_prompts': None,
'tokenizer': tokenizer,
}
return_dict = model.predict_forward(**input_dict)
answer = return_dict["prediction"]
masks = return_dict['prediction_masks']
Dataset
‼️ We release our dataset InterRVOS-127K model on Hugging Face 🤗: wooj0216/InterRVOS-127K
Model Training & Inference
Instructions for training, inference, and evaluation are provided in ReVIOSa/README.md.
Data Annotation
Our automatic data-annotation pipeline are provided in the data_annotation.
Acknowledgement
This project is based on Sa2VA. Many thanks to the authors for their great works!
References
If you find this repository useful, please consider referring to the following paper:
@misc{jin2025interrvosinteractionawarereferringvideo,
title={InterRVOS: Interaction-aware Referring Video Object Segmentation},
author={Woojeong Jin and Seongchan Kim and Jaeho Lee and Seungryong Kim},
year={2025},
eprint={2506.02356},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.02356},
}
Related Skills
docs-writer
98.9k`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie
model-usage
334.1kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
arscontexta
2.8kClaude Code plugin that generates individualized knowledge systems from conversation. You describe how you think and work, have a conversation and get a complete second brain as markdown files you own.
docs
High-performance, modular RAG backend and "Knowledge Engine" Built with Go & Gin, featuring Git-Ops knowledge sync, pgvector semantic search, and OpenAI-compatible model support.
