VistaDPO
[ICML 2025] VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models
Install / Use
/learn @HaroldChen19/VistaDPOREADME
Haojian Huang<sup>1*</sup>, Haodong Chen<sup>2*</sup>, Shengqiong Wu<sup>3</sup>, Meng Luo<sup>3</sup>, Jinlan Fu<sup>3</sup>, Xinya Du<sup>4</sup>, Hanwang Zhang<sup>5</sup>, Hao Fei<sup>3†</sup> <br><br> <sup>*</sup>Equal Contribution, <sup>†</sup>Corresponding Author <br> <sup>1</sup>HKU, <sup>2</sup>HKUST, <sup>3</sup>NUS, <sup>4</sup>UTD, <sup>5</sup>NTU
<h5 align="center"> If you like our project, please give us a star ⭐ on GitHub for latest update. </h2><a href='https://arxiv.org/abs/2504.13122'><img src='https://img.shields.io/badge/arXiv-2504.13122-b31b1b.svg'></a> <a href='https://huggingface.co/datasets/Harold328/VistaDPO-7K'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20VistaDPO7K-Dataset-blue'></a>
</div>Abstract
Large Video Models (LVMs) built upon Large Language Models (LLMs) have shown promise in video understanding but often suffer from misalignment with human intuition and video hallucination issues. To address these challenges, we introduce VistaDPO, a novel framework for Video Hierarchical Spatial-Temporal Direct Preference Optimization. VistaDPO enhances text-video preference alignment across three hierarchical levels: i) Instance Level, aligning overall video content with responses; ii) Temporal Level, aligning video temporal semantics with event descriptions; and iii) Perceptive Level, aligning spatial objects with language tokens. Given the lack of datasets for fine-grained video-language preference alignment, we construct VistaDPO-7k, a dataset of 7.2K QA pairs annotated with chosen and rejected responses, along with spatial-temporal grounding information such as timestamps, keyframes, and bounding boxes. Extensive experiments on benchmarks such as Video Hallucination, Video QA, and Captioning performance tasks demonstrate that VistaDPO significantly improves the performance of existing LVMs, effectively mitigating video-language misalignment and hallucination.
<table class="center"> <tr> <td><img src="assets/vistadpo.png"></td> </tr> </table>🔥 Update
- [2025.05.01]: Our VistaDPO is accepted to ICML'25!.
- [2025.04.18]: Released VistaDPO Paper.
- [2025.04.03]: Initialized this github repository and released training & inference code of VistaDPO on Video-LLaVA.
🧰 TODO
- [x] Release Paper.
- [ ] Release VistaDPO-7K.
- [ ] Release VistaDPO model weights.
- [ ] Release code of VistaDPO on PLLaVA.
📖 Contents
📝 Data
Training data
We use our proposed VistaDPO-7k for training, which can be found in HuggingFace. In this repo, we provide a subset of objects for reference in data.
Evaluation data
The evaluation dataset utilized in our work are listed below:
- Video Hallucination: VideoHallucer, EventHallusion.
- Video QA: MSVD, MSR-VTT, TGIF, ActivityNet, MVBench.
- Video Captioning: VideoChatGPT Bench
🚀 Install
- Clone this repository and navigate to source folder
cd VistaDPO
- Build Environment
echo "Creating conda environment"
conda create -n VistaDPO python=3.10
conda activate VistaDPO
echo "Installing dependencies"
pip install -r requirements.txt
📍 Inference
from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path
from inference.inference_utils import ModelInference, decode2frame
import os
import torch
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
video_path = "./data/videos/_GTwKEPmB-U_5183.mp4"
# CACHE_DIR="/data/VistaDPO/cache"
model_path = "./checkpoints/VistaDPO"
model_name = get_model_name_from_path(model_path)
tokenizer, model, processor, context_len = load_pretrained_model(model_path, model_base = None, device=device, model_name=model_name)
inference_model = ModelInference(model=model, tokenizer=tokenizer, processor=processor, context_len=context_len)
# our pipeline
frame_dir, _ = os.path.splitext(video_path)
decode2frame(video_path, frame_dir, verbose=True)
question="What is the evident theme in the video?"
response = inference_model.generate(
question=question,
modal_path=frame_dir,
temperature=0,
)
print(response)
# using decord
response = inference_model.generate(
question=question,
modal_path=video_path,
temperature=0,
video_decode_backend="decord",
)
print(response)
🚩 Training
VistaDPO training refer to setup and training
bash dpo_scripts/train_dpo.sh
📝 Citation
Please consider citing our paper if our code and benchmark are useful:
@article{huang2025vistadpo,
title={VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models},
author={Huang, Haojian and Chen, Haodong and Wu, Shengqiong and Luo, Meng and Fu, Jinlan and Du, Xinya and Zhang, Hanwang and Fei, Hao},
journal={arXiv preprint arXiv:2504.13122},
year={2025}
}
🍗 Acknowledgement
Our VistaDPO is developed based on the codebases of VideoLLaVA and PLLaVA, and we would like to thank the developers of both.
📪 Contact
For any question, feel free to email haojianhuang927@gmail.com or haroldchen328@gmail.com.
Related Skills
qqbot-channel
349.9kQQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口,自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。
docs-writer
100.4k`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie
model-usage
349.9kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
Design
Campus Second-Hand Trading Platform \- General Design Document (v5.0 \- React Architecture \- Complete Final Version)1\. System Overall Design 1.1. Project Overview This project aims t
