VistaDPO

[ICML 2025] VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models

Generate Convert Improve

Install / Use

/learn @HaroldChen19/VistaDPO

About this skill

Quality Score

0/100

README

<h2 align="center"> <a href="https://arxiv.org/abs/2504.13122">VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models</a></h2>   <div align="center">

Haojian Huang<sup>1*</sup>, Haodong Chen<sup>2*</sup>, Shengqiong Wu<sup>3</sup>, Meng Luo<sup>3</sup>, Jinlan Fu<sup>3</sup>, Xinya Du<sup>4</sup>, Hanwang Zhang<sup>5</sup>, Hao Fei<sup>3†</sup> <br><br> <sup>*</sup>Equal Contribution, <sup>†</sup>Corresponding Author <br> <sup>1</sup>HKU, <sup>2</sup>HKUST, <sup>3</sup>NUS, <sup>4</sup>UTD, <sup>5</sup>NTU

<h5 align="center"> If you like our project, please give us a star ⭐ on GitHub for latest update. </h2>

</div>

Abstract

Large Video Models (LVMs) built upon Large Language Models (LLMs) have shown promise in video understanding but often suffer from misalignment with human intuition and video hallucination issues. To address these challenges, we introduce VistaDPO, a novel framework for Video Hierarchical Spatial-Temporal Direct Preference Optimization. VistaDPO enhances text-video preference alignment across three hierarchical levels: i) Instance Level, aligning overall video content with responses; ii) Temporal Level, aligning video temporal semantics with event descriptions; and iii) Perceptive Level, aligning spatial objects with language tokens. Given the lack of datasets for fine-grained video-language preference alignment, we construct VistaDPO-7k, a dataset of 7.2K QA pairs annotated with chosen and rejected responses, along with spatial-temporal grounding information such as timestamps, keyframes, and bounding boxes. Extensive experiments on benchmarks such as Video Hallucination, Video QA, and Captioning performance tasks demonstrate that VistaDPO significantly improves the performance of existing LVMs, effectively mitigating video-language misalignment and hallucination.

🔥 Update

[2025.05.01]: Our VistaDPO is accepted to ICML'25!.
[2025.04.18]: Released VistaDPO Paper.
[2025.04.03]: Initialized this github repository and released training & inference code of VistaDPO on Video-LLaVA.

🧰 TODO

[x] Release Paper.
[ ] Release VistaDPO-7K.
[ ] Release VistaDPO model weights.
[ ] Release code of VistaDPO on PLLaVA.

📝 Data

Training data

We use our proposed VistaDPO-7k for training, which can be found in HuggingFace. In this repo, we provide a subset of objects for reference in data.

Evaluation data

The evaluation dataset utilized in our work are listed below:

Video Hallucination: VideoHallucer, EventHallusion.
Video QA: MSVD, MSR-VTT, TGIF, ActivityNet, MVBench.
Video Captioning: VideoChatGPT Bench

🚀 Install

Clone this repository and navigate to source folder

cd VistaDPO

Build Environment

echo "Creating conda environment"
conda create -n VistaDPO python=3.10
conda activate VistaDPO

echo "Installing dependencies"
pip install -r requirements.txt

📍 Inference

from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path
from inference.inference_utils import ModelInference, decode2frame
import os
import torch

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

video_path = "./data/videos/_GTwKEPmB-U_5183.mp4"

# CACHE_DIR="/data/VistaDPO/cache"

model_path = "./checkpoints/VistaDPO" 
model_name = get_model_name_from_path(model_path)
tokenizer, model, processor, context_len = load_pretrained_model(model_path, model_base = None, device=device, model_name=model_name)
inference_model = ModelInference(model=model, tokenizer=tokenizer, processor=processor, context_len=context_len)

# our pipeline
frame_dir, _ = os.path.splitext(video_path)
decode2frame(video_path, frame_dir, verbose=True)
question="What is the evident theme in the video?"
response = inference_model.generate(
    question=question,
    modal_path=frame_dir,
    temperature=0,
)
print(response)

# using decord 
response = inference_model.generate(
    question=question,
    modal_path=video_path,
    temperature=0,
    video_decode_backend="decord",
)
print(response)

🚩 Training

VistaDPO training refer to setup and training

bash dpo_scripts/train_dpo.sh

📝 Citation

Please consider citing our paper if our code and benchmark are useful:

@article{huang2025vistadpo,
  title={VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models},
  author={Huang, Haojian and Chen, Haodong and Wu, Shengqiong and Luo, Meng and Fu, Jinlan and Du, Xinya and Zhang, Hanwang and Fei, Hao},
  journal={arXiv preprint arXiv:2504.13122},
  year={2025}
}

🍗 Acknowledgement

Our VistaDPO is developed based on the codebases of VideoLLaVA and PLLaVA, and we would like to thank the developers of both.

📪 Contact

For any question, feel free to email haojianhuang927@gmail.com or haroldchen328@gmail.com.

Related Skills

qqbot-channel

349.9k

QQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口，自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。

docs-writer

100.4k

`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie

model-usage

349.9k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

Design

Campus Second-Hand Trading Platform \- General Design Document (v5.0 \- React Architecture \- Complete Final Version)1\. System Overall Design 1.1. Project Overview This project aims t

HaroldChen19

View profile

View on GitHub

GitHub Stars41

CategoryContent

Updated5d ago

Forks1

HaroldChen19/VistaDPO

Languages

Python

Security Score

75/100

Audited on Apr 1, 2026

No findings