TimeSuite

[ICLR 2025] TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning

Generate Convert Improve

Install / Use

/learn @OpenGVLab/TimeSuite

About this skill

Quality Score

0/100

README

<div align="center"> <h2><a href="https://arxiv.org/abs/2410.19702">[ICLR 2025] TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning</a></h2>

Xiangyu Zeng, Kunchang Li, Chenting Wang, Xinhao Li, Tianxiang Jiang, Ziang Yan, Songze Li, Yansong Shi, Zhengrong Yue, Yi Wang, Yali Wang, Yu Qiao, and Limin Wang

</div>

:parrot: Introduction

This paper proposes TimeSuite, a collection of new designs to adapt the existing short-form video MLLMs for long video understanding, including a simple yet efficient framework to process long video sequence, a high-quality video dataset for grounded tuning of MLLMs, and a carefully-designed instruction tuning task to explicitly incorporate the grounding supervision in the traditional QA format.

State-of-the-art performance: VideoChat-T demonstrates high performance for both long-form video question answering and temporal grounding. alt text

Highly efficient model architecture with exceptional inference speed, encoding each video frame into just 3 tokens, leading to the flops of our VideoChat-T are 5.1% of Llava-OneVision alt text

High-quality data

We introduced the comprehensive dataset TimePro, which includes 9 task types with video sources from 15 different datasets.
We designed a novel Temporal Grounded Caption fine-tuning task to effectively mitigate hallucinations in MLLM.

:fire: Updates

2025.02.12 TimeSuite is open-sourced. We welcome everyone to try it out!
2025.01.23 TimeSuite has been accepted by ICLR 2025.
2024.10.25 The paper of TimeSuite has been uploaded to arXiv.

Preparation

Create a new environment and run the command to install the necessary dependencies.

conda create --name TimeSuite
conda activate TimeSuite
pip install -r requirements.txt

Download the model and code of TimeSuite from https://huggingface.co/Lanxingxuan/TimeSuite to the ./download folder. (Please note that you need to additionally download Mistral-7B-Instruct-v0.2 to ./download/parameters)
Search for all instances of /path_to_the_timesuite_root_folder and replace them with the directory of the TimeSuite root folder.
Please search for all video dataset paths containing s3:// and replace them with the corresponding video dataset paths on your server.

Inference & Demo

Run demo/demo.ipynb to see the demo provided in the paper, or try out the videos and questions of your choice.
Run eval/eval_qa_tasks.ipynb to test the general QA performance of the model.
To test the temporal grounding capability of TimeSuite, please follow these two steps.

bash eval/test_grounding.sh
bash eval/get_grounding_result.sh

Grounded Tuning

Please properly configure the video dataset path in configs/instruction_data.py.
Modify scripts/videochat_mistral/config_LinearP.py and scripts/videochat_mistral/config_LinearProAda.py to adjust the model training parameter settings.
Please run bash scripts/videochat_mistral/run_7b_stage4.sh to initiate the fine-tuning of the model.
To reproduce the fine-tuning results presented in the paper, you need to initiate the model training in a two-stage manner. For detailed parameter settings, please refer to Appendix D of the paper.

TimePro Dataset

Annotations

All data used for fine-tuning is now open-sourced. Please visit https://huggingface.co/Lanxingxuan/TimeSuite/tree/main/datasets/TimePro to download.

Videos

TimePro

DiDeMo: https://github.com/LisaAnne/LocalizingMoments?tab=readme-ov-file#dataset
QuerYD: https://www.robots.ox.ac.uk/~vgg/data/queryd/
HiREST: https://github.com/j-min/HiREST
ActivityNet: http://activity-net.org/download.html
ViTT: https://github.com/google-research-datasets/Video-Timeline-Tags-ViTT
YouCook2: http://youcook2.eecs.umich.edu/download
TVSum: https://github.com/yalesong/tvsum
SumMe: http://classif.ai/dataset/ethz-cvl-video-summe/
COIN: https://github.com/coin-dataset/annotations
YT-Temporal: https://rowanzellers.com/merlot/#data
Internvid: https://github.com/OpenGVLab/InternVideo/blob/main/Data/InternVid/README_CN.md
HowTo100M(CosMo): https://www.di.ens.fr/willow/research/howto100m/

Normal

VideoChatGPT: https://github.com/mbzuai-oryx/Video-ChatGPT/tree/main/data
VideoChat: https://github.com/OpenGVLab/InternVideo/tree/main/Data/instruction_data
EgoQA: https://ego4d-data.org/
STAR: https://bobbywu.com/STAR/
MovieChat: https://huggingface.co/datasets/Enxin/MovieChat-1K_train

Charades-STA: https://github.com/jiyanggao/TALL#charades-sta-anno-download
QVHighlight: https://github.com/jayleicn/moment_detr/blob/main/data/README.md

:page_facing_up: Citation

If you find this project useful in your research, please consider cite:

@misc{zeng2024timesuite,
      title={TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning}, 
      author={Xiangyu Zeng and Kunchang Li and Chenting Wang and Xinhao Li and Tianxiang Jiang and Ziang Yan and Songze Li and Yansong Shi and Zhengrong Yue and Yi Wang and Yali Wang and Yu Qiao and Limin Wang},
      year={2024},
      eprint={2410.19702},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2410.19702}, 
}

:dizzy: Acknowledgement

Thanks to the open source of the following projects:

Related Skills

qqbot-channel

353.3k

QQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口，自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。

docs-writer

100.7k

`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie

model-usage

353.3k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

project-overview

FlightPHP Skeleton Project Instructions This document provides guidelines and best practices for structuring and developing a project using the FlightPHP framework. Instructions for AI Coding A

OpenGVLab

View profile

View on GitHub

GitHub Stars88

CategoryContent

Updated15h ago

Forks3

OpenGVLab/TimeSuite

Languages

Python

Security Score

100/100

Audited on Apr 9, 2026

No findings