SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding

[🚀 Project Homepage] [📖 Paper] [🤗 HF Dataset] [🤖 HF Model] [🏆 Leaderboard] [🥇 Leaderboard Submission] [📚 中文解读]

</div>

This is the code repository of the paper "SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding", which aims to provide a comprehensive overview of the SVBench dataset.

News 🚀🚀🚀

2025.11.24: 🔥 We have open-sourced the test set of SVBench! You can now evaluate your own models immediately!
2025.04.01: 🔥 We have open-sourced StreamingChat! Start using it for streaming inference right away!
2025.03.23: 🔥 Exciting news! You can now check out the real-time leaderboard and submit your LVLMs results in Google Form. We can't wait to see your amazing work!
2025.03.16: 🔥 SVBench is now released! Check out the paper for detailed insights, and access the dataset on HuggingFace.

Overview

Illustration of temporal multi-turn dialogues. A temporal dialogue path represents a conversation within a video progressing over time. Our SVBench evaluates the capabilities of LVLMs in long-context streaming video understanding by constructing temporal dialogue paths to assess 9 critical skills.

overview

Abstract

Despite the significant advancements of Large Vision-Language Models (LVLMs) on established benchmarks, there remains a notable gap in suitable evaluation regarding their applicability in the emerging domain of long-context streaming video understanding. Current benchmarks for video understanding typically emphasize isolated single-instance text inputs and fail to evaluate the capacity to sustain temporal reasoning throughout the entire duration of video streams. To address these limitations, we introduce SVBench, a pioneering benchmark with temporal multi-turn question-answering chains specifically designed to thoroughly assess the capabilities of streaming video understanding of current LVLMs. We design a semi-automated annotation pipeline to obtain 49,979 Question-Answer (QA) pairs of 1,353 streaming videos, which includes generating QA chains that represent a series of consecutive multi-turn dialogues over video segments and constructing temporal linkages between successive QA chains. Our experimental results, obtained from 14 models in dialogue and streaming evaluations, reveal that while the closed-source GPT-4o outperforms others, most open-source LVLMs struggle with long-context streaming video understanding. We also construct a StreamingChat model, which significantly outperforms open-source LVLMs on our SVBench and achieves comparable performance on diverse vision-language benchmarks. We expect SVBench to advance the research of streaming video understanding by providing a comprehensive and in-depth analysis of current LVLMs. Our benchmark and model can be accessed at https://yzy-bupt.github.io/SVBench.

Getting Started

If you just want to use the SVBench dataset for training or evaluating your streaming model, feel free to directly visit [HF Dataset]. If you need to annotate your own streaming video dataset according to the SVBench pipeline, please continue reading the following content.

Installation

Clone the repository, click Download file
Install Python dependencies

conda create -n SVBench -y python=3.8.18
conda activate SVBench
conda install -y -c pytorch pytorch=1.11.0 torchvision=0.12.0
pip install opencv-python=4.10.0.84

Data Preparation

Directly download our filtered SVBench dataset (recommended):

(1) Download SVBench dataset from Hugging Face:

git clone https://huggingface.co/yzy666/SVBench

(2) Navigate to the dataset directory:

cd SVBench

(3) Concatenate the split files:

Use the cat command to concatenate all the split files into a single file. Assuming your split files are named from allVideos.part_aa to allVideos.part_ch, you can use the following command:

cat allVideos_tar_sep/allVideos.part_* > allVideo.tar.gz

(4) Verify the integrity of the file (optional):

Use the md5sum command to compute the checksum of the concatenated file and compare it with the provided checksum f5d08deb0d516c23caf8f1f6f0cda7d3:

md5sum allVideo.tar.gz

The output should look like this:

f5d08deb0d516c23caf8f1f6f0cda7d3  allVideo.tar.gz

If the checksum matches f5d08deb0d516c23caf8f1f6f0cda7d3, the file is intact and correct.

(5) Extract the concatenated file:

Use the tar command to extract the contents of allVideo.tar.gz:

tar -xzvf allVideo.tar.gz

After completing these steps, you should see the extracted video files in the current directory.

ring

Download from the official website:

YT-Temporal-1B

Download the YT-Temporal-1B dataset following the instructions in the official web.

YouCook2

Download the YouCook2 dataset following the instructions in the official web.

MovieChat

Download the MovieChat dataset following the instructions in the official web.

Panda-70M

Download the Panda-70M dataset following the instructions in the official web.

Ego4D

Download the ActivityNet dataset following the instructions in the official web.

ActivityNet

Download the ActivityNet dataset following the instructions in the official web.

SVBench

This section provides instructions for reproducing the annotation and evaluation of SVBench.

framework

1. Data Filtering

Run the following commands to obtain filtered videos.

Firstly, you should install Open-Sora, and have a raw video dataset prepared. A meta file of the dataset information is needed for data processing. To create a meta file from a folder, run:

python -m Data_Filtering/Open-Sora-main/tools.datasets.convert video /path_to_your_video_folder --output /path_to_save_your_meta.csv

Then, run the following commands to get aesthetic scores and optical flow scores of your videos. Make sure the meta file has column 'path'.

torchrun --nproc_per_node 8 -m Data_Filtering/Open-Sora-main/tools.scoring.aesthetic.inference /path_to_save_your_meta_with_aesthetic_scores.csv --bs 1024 --num_workers 16
torchrun --standalone --nproc_per_node 8 Data_Filtering/Open-Sora-main/tools/scoring/optical_flow/inference.py /path_to_save_your_meta_with_optical_flow_scores.csv

With these information of videos above, you can filtering is conducted to retain only those videos containing 5 to 15 scenes,Then you can retain videos with an aesthetic score of 4 or above and with optical flow scores within the range of 0.5 to 100

2. Scene Detection and Video Splitting

First you should have a meta file with column 'path' for the videos. Then run the following command:

python Data_Filtering/Open-Sora-main/tools.scene_cut.scene_detect.py ---output /path_to_meta.csv

The output is {prefix}_timestamp.csv with column timestamp. Each cell in column timestamp is a list of tuples, with each tuple indicating the start and end timestamp of a scene (e.g., [('00:00:01.234', '00:00:02.345'), ('00:00:03.456', '00:00:04.567')]).

3. Video Frame Extracting

Video frame extraction can be directly run the following code. Run the following command:

python extract_video_frame/extract_video_frame_1s.py --data_dir allVideo --output_dir allVideo_frame

4. Constructing QA Chains for Video Dialogues

Run the following command to construct QA chains for video dialogues.

'video_meta_with_timestamp' is the path of meta file of videos with video paths and timestamps.

'video_frame_folder' is the path of the folder saving all the video frames.

'output_folder' is the path of folder you want to save QA chains without right format.

python construct_data/construct_QA_chain.py --video_meta_with_timestamp /path_to_your_meta.csv --video_frame_folder /path_to_your_all_videos_frame_folder --output_folder /path_to_your_output_folder

Then run the following command to process the format of QA chains generated in first step.

'not_processed_QA_chain_folder' is the path of folder you save QA chains without right format.

'output_folder' is the path of folder you want to save QA chains with right format.

python construct_data/process_QA_chain_format --not_processed_QA_chain_folder /path_to_your_not_processed_QA_chain_folder --output_folder /path_to_your_output_folder

5. Implementing QA Quality Evaluation

Run the following command:

python evaluation/eval_QA_quality.py

6. Identifying Temporal Linkages

Run the following command to construct identify temporal linkages.

'QA_chains_folder' is the path of folder you want to save QA chains annotated with ri

SVBench

Install / Use

README

SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding

News 🚀🚀🚀

Overview

Abstract

Getting Started

Installation

Data Preparation

YT-Temporal-1B

YouCook2

MovieChat

Panda-70M

Ego4D

ActivityNet

SVBench

1. Data Filtering

2. Scene Detection and Video Splitting

3. Video Frame Extracting

4. Constructing QA Chains for Video Dialogues

5. Implementing QA Quality Evaluation

6. Identifying Temporal Linkages