UniTime
Universal Video Temporal Grounding with Generative Multi-modal Large Language Models
Install / Use
/learn @Lzq5/UniTimeREADME
UniTime
This repository provides the official PyTorch implementation of "Universal Video Temporal Grounding with Generative Multi-modal Large Language Models" (NeurIPS 2025).
🌐 Project Page $\cdot$ 📄 Paper $\cdot$ 🤗 Model
<div align="center"> <img src="./assets/teaser.png"> </div>🔥 News
- [2025.10] Released the code for data construction, training, and evaluation.
- [2025.09] UniTime accepted to NeurIPS 2025!
- [2025.06] Released the inference code.
- [2025.06] Preprint available on arXiv.
⚙️ Installation
conda create -n UniTime python=3.10
conda activate UniTime
pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt
🚀 Quick Start
-
Download Model Checkpoints
- Obtain the pretrained checkpoints from Qwen2-VL-7B and UniTime.
- Set
model_local_pathto your local path for Qwen2-VL-7B, andmodel_finetune_pathto your UniTime checkpoint.
-
Prepare Input Data
- Create a JSON file for inference as
data/test.json, and specify its path via thedata_pathargument.
- Create a JSON file for inference as
-
Run Inference
- Execute the following command to perform inference. The output results will be saved in the
results/directory.
export CUDA_VISIBLE_DEVICES=0 python inference.py --model_local_path path_to_qwen2vl7B \ --model_finetune_path ckpt/unitime \ --data_path data/test.json \ --output_dir ./results/test \ --nf_short 128 - Execute the following command to perform inference. The output results will be saved in the
Data Preparation
-
Download the video and annotation files for each dataset from the corresponding source links.
-
Create the input file following the format below:
[ { "qid": 0, "id": "3MSZA", "annos": [ { "query": "person turn a light on.", "window": [[24.3, 30.4]] } ], "duration": 30.96, "video_path": "./videos/3MSZA.mp4", "mode": "mr", } ]Example construction code for Ego4D-NLQ can be found in
datasets/data_ego4d.py(seeload_data_to_dict()function). Modify it as needed for other datasets. -
(Optional) You may also download preprocessed annotations for each dataset from UniTime-Data.
Training and Evaluation
Execute the following commands in sequence:
# Feature Extraction
bash scripts/feature.sh
# Training
bash scripts/train.sh
# Evaluation
bash scripts/eval.sh
# Metrics
python eval_metrics.py --res ./results/RUN_NAME/results.json
Note: Modify the arguments marked with ToModify in the code according to the following definitions:
| Argument | Description |
|-|-|
| path_to_qwen2vl7B | Path to the Qwen2-VL-7B model directory |
| path_to_feature_root | Root directory containing features for all datasets |
| path_to_video_root | Root directory path containing all video files |
| path_to_train_data | Path to training set annotation file generated by datasets/data_ego4d.py |
| path_to_val_data | Path to validation set annotation file generated by datasets/data_ego4d.py |
| path_to_test_data | Path to test set annotation file generated by datasets/data_ego4d.py |
| path_to_feature_folder | Subfolder under path_to_feature_root for a specific dataset |
| RUN_NAME | Experiment identifier/name for this training run |
Citation
If you use this code and data for your research or project, please cite:
@inproceedings{unitime2025,
title={Universal Video Temporal Grounding with Generative Multi-modal Large Language Models},
author={Li, Zeqian and Di, Shangzhe and Zhai, Zhonghua and Huang, Weilin and Wang, Yanfeng and Xie, Weidi},
booktitle={NeurIPS},
year={2025}
}
Acknowledgements
This project builds upon several excellent open-source efforts:
Contact
For questions, please contact: lzq0103@sjtu.edu.cn.
Related Skills
qqbot-channel
345.9kQQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口,自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。
docs-writer
100.0k`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie
model-usage
345.9kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
arscontexta
2.9kClaude Code plugin that generates individualized knowledge systems from conversation. You describe how you think and work, have a conversation and get a complete second brain as markdown files you own.
