TempCompass

[ACL 2024 Findings] "TempCompass: Do Video LLMs Really Understand Videos?", Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, Lu Hou

Generate Convert Improve

Install / Use

/learn @llyx97/TempCompass

About this skill

Quality Score

0/100

README

<h2 align="center"> <a href="https://arxiv.org/abs/2403.00476">TempCompass: A benchmark to evaluate the temporal perception ability of Video LLMs</a></h2> <div align="center"> <a href='https://arxiv.org/abs/2403.00476'><img src='https://img.shields.io/badge/ArXiv-2403.00476-red'></a> <a href='https://llyx97.github.io/tempcompass/'><img src='https://img.shields.io/badge/Project-Page-Green'></a> <a href='https://huggingface.co/spaces/lyx97/TempCompass'><img src='https://img.shields.io/badge/🤗_Hugging_Face-Leaderboard-blue'></a> <a href='https://huggingface.co/datasets/lmms-lab/TempCompass'><img src='https://img.shields.io/badge/🤗_Hugging_Face-Datasets-green'></a> </div> <div> <div align="center"> <a href='https://llyx97.github.io/' target='_blank'>Yuanxin Liu<sup>1*</sup></a>&emsp; <a href='https://lscpku.github.io/' target='_blank'>Shicheng Li<sup>1*</sup></a>&emsp; <a href='https://liuyi-pku.github.io/' target='_blank'>Yi Liu<sup>1</sup></a>&emsp; Yuxiang Wang<sup>1</sup>&emsp; <a href='https://renshuhuai-andy.github.io/' target='_blank'>Shuhuai Ren<sup>1</sup></a>&emsp; </br> <a href='https://lilei-nlp.github.io/' target='_blank'>Lei Li<sup>2</sup></a>&emsp; <a href='https://pkucss.github.io/' target='_blank'>Sishuo Chen<sup>1</sup></a>&emsp; <a href='https://xusun26.github.io/' target='_blank'>Xu Sun<sup>1</sup></a>&emsp; <a href='https://houlu369.github.io/' target='_blank'>Lu Hou<sup>3</sup></a> </div> <div> <div align="center"> <sup>1</sup>Peking University&emsp; <sup>2</sup>The University of Hong Kong&emsp; <sup>3</sup>Huawei Noah’s Ark Lab </div> <div align="center"> <sup>*</sup>Equal Contribution </div>

📢 News

[2024-10-30] 🎉🎉🎉 TempCompass is integrated into VLMEvalKit.

[2024-08-30] Results of Qwen2-VL, GPT-4o, MiniCPM-V-2.6, InternVL-2-8B, LLaVA-OneVision-Qwen-2-7B and InterLM-XComposer-2.5 are added to the leaderboard. GPT-4o establishes the new SoTA!

[2024-08-08] Results of LLaVA-Next-Video, VILA-1.5 and LongVA are added to the leaderboard.

[2024-07] 🎉🎉🎉 TempCompass is integrated into LMMs-Eval. See here for usage examples.

[2024-06-11] Result of Reka-core is added to the leaderboard.

[2024-05-25] TempCompass Leaderboard is available on HuggingFace Space 🤗.

[2024-05-16] 🎊🎊🎊 TempCompass is accepted at ACL 2024 Findings!

[2024-04-14] Evaluation result of Gemini-1.5-pro, the current SOTA Video LLM, is add.

[2024-03-23] The answer prompt is improved to better guide Video LLMs to follow the desired answer formats. The evaluation code now provides an option to disable the use of ChatGPT.

[2024-03-12] 🔥🔥🔥 The evaluation code is released now! Feel free to evaluate your own Video LLMs.

🏆 LeaderBoard

✨ Highlights

Diverse Temporal Aspects and Task Formats

TempCompass encompasses a diverse set of temporal aspects (left) and task formats (right) to comprehensively evaluate the temporal perception capability of Video LLMs.

Conflicting Videos

We construct conflicting videos to prevent the models from taking advantage of single-frame bias and language priors.
🤔 Can your Video LLM correctly answer the following question for both two videos?
<img src="./assets/1021488277.gif" alt="Raw Video" style="float: left; width: 49%; margin-right: 10px;"> <img src="./assets/1021488277_reverse.gif" alt="Conflicting Video" style="float: left; width: 49%;">

What is happening in the video?
A. A person drops down the pineapple
B. A person pushes forward the pineapple
C. A person rotates the pineapple
D. A person picks up the pineapple

🚀 Quick Start

To begin with, clone this repository and install some packages:

git clone https://github.com/llyx97/TempCompass.git
cd TempCompass
pip install -r requirements.txt

Data Preparation

1. Task Instructions

The task instructions can be found in questions/.

<details> <summary><span id="instruct_gen"> Task Instruction Generation Procedure </span></summary>

Generate Multi-Choice QA instructions (question_gen.py).
Manually validate quality and rectify.
Generate task instructions for Yes/No QA (question_gen_yes_no.py), Caption Matching (question_gen_caption_match.py) and Caption Generation (question_gen_captioning.py), based on manually rectified Multi-Choice QA instructions.
Manually validate quality and rectify.

</details>

2. Videos

All the processed videos can be downloaded from google drive or huggingface.

<details> <summary><span id="instruct_gen"> As an alternative, you can also download the raw videos and process them yourself </span></summary>

Run the following commands. The videos will be saved to videos/.

cd utils
python download_video.py    # Download raw videos
python process_videos.py    # Construct conflicting videos

Note: If you encounter a MoviePy error when running the processing script, please refer to this issue.

</details>

Run Inference

We use Video-LLaVA and Gemini as examples to illustrate how to conduct MLLM inference on our benchmark.

1. Video-LLaVA

Enter run_video_llava and install the environment as instructed.

Then run the following commands. The prediction results will be saved to predictions/video-llava/<task_type>.

# select <task_type> from multi-choice, yes_no, caption_matching, captioning
python inference_dataset.py --task_type <task_type>

2. Gemini

The inference script for gemini-1.5-pro is run_gemini.ipynb. It is recommended to run the script in Google Colab.

<span id="eval"> Run Evaluation </span>

After obtaining the MLLM predictions, run the following commands to conduct automatic evaluation. Remember to set your own $OPENAI_API_KEY in utils/eval_utils.py.

Multi-Choice QA python eval_multi_choice.py --video_llm video-llava
Yes/No QA python eval_yes_no.py --video_llm video-llava
Caption Matching python eval_caption_matching.py --video_llm video-llava
Caption Generation python eval_captioning.py --video_llm video-llava

Tip👉: Except for Caption Generation, you can set --disable_llm when running the scripts, which will disable chatgpt-based evaluation (i.e., entirely rely on rule-based evaluation). This is useful when you do not want to use ChatGPT API and your MLLM is good at following the instruction to generate answers of specific format.

The results of each data point will be saved to auto_eval_results/video-llava/<task_type>.json and the overall results on each temporal aspect will be printed out as follows:

{'action': 76.0, 'direction': 35.2, 'speed': 35.6, 'order': 37.7, 'attribute_change': 41.0, 'avg': 45.6}
{'fine-grained action': 58.8, 'coarse-grained action': 90.3, 'object motion': 36.2, 'camera motion': 32.6, 'absolute speed': 47.6, 'relative speed': 28.0, 'order': 37.7, 'color & light change': 43.6, 'size & shape change': 39.4, 'combined change': 41.7, 'other change': 38.9}
Match Success Rate=100.0

<span id="lmms-eval"> LMMs-Eval Evaluation </span>

Here we provide an example of how to evaluate LLaVA-Next-Video on TempCompass, using lmms-eval.

1. Clone the repo from LLaVA-Next and setup environments

git clone https://github.com/LLaVA-VL/LLaVA-NeXT
cd LLaVA-NeXT
pip install -e .

2. Run inference and evaluation in a single command

accelerate launch --num_processes 8 --main_process_port 12345 -m lmms_eval \
    --model llavavid \
    --model_args pretrained=lmms-lab/LLaVA-NeXT-Video-32B-Qwen,conv_template=qwen_1_5,video_decode_backend=decord,max_frames_num=32,mm_spatial_pool_mode=average,mm_newline_position=grid,mm_resampler_location=after \
    --tasks tempcompass \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix llava_vid_32B \
    --output_path ./logs/

You can also evaluate the performance on each task (e.g., multi-choice) seperately:

accelerate launch --num_processes 8 --main_process_port 12345 -m lmms_eval \
    --model llavavid \
    --model_args pretrained=lmms-lab/LLaVA-NeXT-Video-32B-Qwen,conv_template=qwen_1_5,video_decode_backend=decord,max_frames_num=32,mm_spatial_pool_mode=average,mm_newline_position=grid,mm_resampler_location=after \
    --tasks tempcompass_multi_choice \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix llava_vid_32B \
    --outpu

Related Skills

qqbot-channel

347.2k

QQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口，自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。

docs-writer

100.1k

`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie

model-usage

347.2k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

Design

Campus Second-Hand Trading Platform \- General Design Document (v5.0 \- React Architecture \- Complete Final Version)1\. System Overall Design 1.1. Project Overview This project aims t

llyx97

View profile

View on GitHub

GitHub Stars131

CategoryContent

Updated8d ago

Forks4

llyx97/TempCompass

Languages

Python

Security Score

85/100

Audited on Mar 26, 2026

No findings