SkillAgentSearch skills...

UniTime

Universal Video Temporal Grounding with Generative Multi-modal Large Language Models

Install / Use

/learn @Lzq5/UniTime
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

UniTime

This repository provides the official PyTorch implementation of "Universal Video Temporal Grounding with Generative Multi-modal Large Language Models" (NeurIPS 2025).

🌐 Project Page $\cdot$ 📄 Paper $\cdot$ 🤗 Model

<div align="center"> <img src="./assets/teaser.png"> </div>

🔥 News

  • [2025.10] Released the code for data construction, training, and evaluation.
  • [2025.09] UniTime accepted to NeurIPS 2025!
  • [2025.06] Released the inference code.
  • [2025.06] Preprint available on arXiv.

⚙️ Installation

conda create -n UniTime python=3.10
conda activate UniTime
pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt

🚀 Quick Start

  1. Download Model Checkpoints

    • Obtain the pretrained checkpoints from Qwen2-VL-7B and UniTime.
    • Set model_local_path to your local path for Qwen2-VL-7B, and model_finetune_path to your UniTime checkpoint.
  2. Prepare Input Data

    • Create a JSON file for inference as data/test.json, and specify its path via the data_path argument.
  3. Run Inference

    • Execute the following command to perform inference. The output results will be saved in the results/ directory.
    export CUDA_VISIBLE_DEVICES=0
    python inference.py --model_local_path path_to_qwen2vl7B \
         --model_finetune_path ckpt/unitime \
         --data_path data/test.json \
         --output_dir ./results/test \
         --nf_short 128
    

Data Preparation

  1. Download the video and annotation files for each dataset from the corresponding source links.

  2. Create the input file following the format below:

    [
        {
            "qid": 0, 
            "id": "3MSZA", 
            "annos": [
                {
                    "query": "person turn a light on.",
                    "window": [[24.3, 30.4]]
                }
            ],
            "duration": 30.96,
            "video_path": "./videos/3MSZA.mp4",
            "mode": "mr",
        }
    ]
    

    Example construction code for Ego4D-NLQ can be found in datasets/data_ego4d.py (see load_data_to_dict() function). Modify it as needed for other datasets.

  3. (Optional) You may also download preprocessed annotations for each dataset from UniTime-Data.

Training and Evaluation

Execute the following commands in sequence:

# Feature Extraction
bash scripts/feature.sh

# Training
bash scripts/train.sh

# Evaluation
bash scripts/eval.sh

# Metrics
python eval_metrics.py --res ./results/RUN_NAME/results.json

Note: Modify the arguments marked with ToModify in the code according to the following definitions:

| Argument | Description | |-|-| | path_to_qwen2vl7B | Path to the Qwen2-VL-7B model directory | | path_to_feature_root | Root directory containing features for all datasets | | path_to_video_root | Root directory path containing all video files | | path_to_train_data | Path to training set annotation file generated by datasets/data_ego4d.py | | path_to_val_data | Path to validation set annotation file generated by datasets/data_ego4d.py | | path_to_test_data | Path to test set annotation file generated by datasets/data_ego4d.py | | path_to_feature_folder | Subfolder under path_to_feature_root for a specific dataset | | RUN_NAME | Experiment identifier/name for this training run |

Citation

If you use this code and data for your research or project, please cite:

@inproceedings{unitime2025,
    title={Universal Video Temporal Grounding with Generative Multi-modal Large Language Models},
    author={Li, Zeqian and Di, Shangzhe and Zhai, Zhonghua and Huang, Weilin and Wang, Yanfeng and Xie, Weidi},
    booktitle={NeurIPS},
    year={2025}
}

Acknowledgements

This project builds upon several excellent open-source efforts:

Contact

For questions, please contact: lzq0103@sjtu.edu.cn.

Related Skills

View on GitHub
GitHub Stars49
CategoryContent
Updated3d ago
Forks2

Languages

Python

Security Score

75/100

Audited on Mar 30, 2026

No findings