MiraData

Official repo for paper "MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions"

Generate Convert Improve

Install / Use

/learn @mira-space/MiraData

About this skill

Quality Score

0/100

README

MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions

Xuan Ju<sup>1*</sup>, Yiming Gao<sup>1*</sup>, Zhaoyang Zhang<sup>1*#</sup>, Ziyang Yuan<sup>1</sup>, Xintao Wang<sup>1</sup>, Ailing Zeng, Yu Xiong, Qiang Xu, Ying Shan<sup>1</sup> <br> <sup>1</sup>ARC Lab, Tencent PCG <sup>2</sup>The Chinese University of Hong Kong <sup>*</sup>Equal Contribution <sup>#</sup>Project Lead

</div>

Introduction

Video datasets play a crucial role in video generation such as Sora. However, existing text-video datasets often fall short when it comes to handling long video sequences and capturing shot transitions. To address these limitations, we introduce MiraData, a video dataset designed specifically for long video generation tasks. Moreover, to better assess temporal consistency and motion intensity in video generation, we introduce MiraBench, which enhances existing benchmarks by adding 3D consistency and tracking-based motion strength metrics. You can find more details in our research papaer.

Key Features of MiraData

Long Video Duration: Unlike previous datasets, where video clips are often very short (typically less than 20 seconds), MiraData focuses on uncut video segments with durations of an average of 72 seconds. This extended duration allows for more comprehensive modeling of video content.
Structured Captions: Each video in MiraData is accompanied by structural captions. These captions provide detailed descriptions from various perspectives, enhancing the richness of the dataset. The average caption length is 318 words, ensuring a thorough representation of the video content.

alt text

Dataset

Meta Files

We release four versions of MiraData, containing 330K, 93K, 42K, 9K data.

The meta file for this version of MiraData is provided in Google Drive and HuggingFace Dataset. Additionally, for a better and quicker understanding of our meta file composition, we randomly sample a set of 100 video clips, which can be accessed here. The meta file contains the following index information:

clip_id: video clip index, which is composed of {download_id}.{clip_id}
source: video download source and category
video_url: video source url
video_id: video id in the source website
width: video width
height: video height
fps: video fps used for extracting frame
seconds: duration of the video clip
timestamp: clip start and end timestamp in source video (used for cutting the video clip from its source video)
frame_number: frame number of the video clip
framestamp: clip start and end frame in source video
file_path: file path for storing the video clip
short_caption: a short overall caption
dense_caption: a dense overall caption
background_caption: caption of the video background
main_object_caption: caption of the main object in video
style_caption: caption of the video style
camera_caption: caption of the camera move

Download

To download the videos and split them into clips, start by downloading the meta files from Google Drive or HuggingFace Dataset. Once you have the meta files, you can use the following scripts to download the video samples:

python download_data.py --meta_csv {meta file} --download_start_id {the start of download id} --download_end_id {the end of download id} --raw_video_save_dir {the path of saving raw videos} --clip_video_save_dir {the path of saving cutted video}

<sup>We will remove the video samples from our dataset / Github / project webpage as long as you need it. Please contact to us for the request.</sup>

Collection and Annotation

To collect the MiraData, we first manually select youtube channels in different scenarios and include videos from HD-VILA-100M, Videovo, Pixabay, and Pexels. Then, all the videos in corresponding channels are downloaded and splitted using PySceneDetect. We then used multiple models to stitch the short clips together and filter out low-quality videos. Following this, we selected video clips with long durations. Finally, we captioned all video clips using GPT-4V.

alt text

Structured Captions

Each video in MiraData is accompanied by structured captions. These captions provide detailed descriptions from various perspectives, enhancing the richness of the dataset.

Six Types of Captions

Main Object Description: Describes the primary object or subject in the video, including their attributes, actions, positions, and movements throughout the video.
Background: Provides context about the environment or setting, including objects, location, weather, and time.
Style: Covers artistic style, visual and photographic aspects, such as realistic, cyberpunk, and cinematic style.
Camera Movement: Details any camera pans, zooms, or other movements.
Short Caption: A concise summary capturing the essence of the video, generated using the Panda-70M caption model.
Dense Caption: A more elaborate and detailed description that summarizes the above five types of captions.

Captions with GPT-4V

We tested the existing open-source visual LLM methods and GPT-4V, and found that GPT-4V's captions show better accuracy and coherence in semantic understanding in terms of temporal sequence.

In order to balance annotation costs and caption accuracy, we uniformly sample 8 frames for each video and arrange them into a 2x4 grid of one large image. Then, we use the caption model of Panda-70M to annotate each video with a one-sentence caption, which serves as a hint for the main content, and input it into our fine-tuned prompt. By feeding the fine-tuned prompt and a 2x4 large image to GPT-4V, we can efficiently output captions for multiple dimensions in just one round of conversation. The specific prompt content can be found in the caption_gpt4v.py, and we welcome everyone to contribute to the more high-quality text-video data. :raised_hands:

<div style="display:inline-block" align=center> <img src="assets/words.png" width="350"/>  </div> <div style="display:inline-block" align=center> Text length statistics of short, dense and structure captions.</div> <div style="display:inline-block" align=center> <img src="assets/short_caption.png" width="300"/> <img src="assets/dense_caption.png" width="300"/> </div> <div style="display:inline-block" align=center>Word cloud of short captions.                 Word cloud of dense captions.</div>

Benchmark - MiraBench

To evaluate long video generation, we design 17 evaluation metrics in MiraBench from 6 perspectives, including temporal consistency, temporal motion strength, 3D consistency, visual quality, text-video alignment, and distribution consistency. These metrics encompass most of the common evaluation standards used in previous video generation models and text-to-video benchmarks.

To evaluate generated videos, please first set up python environment through:

pip install torch==1.12.1+cu116 torchvision==0.13.1+cu116 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu116
pip install -r requirements.txt

Then, run evaluation through:

python calculate_score.py --meta_file data/evaluation_example/meta_generated.csv --frame_dir data/evaluation_example/frames_generated --gt_meta_file data/evaluation_example/meta_gt.csv --gt_frame_dir data/evaluation_example/frames_gt --output_folder data/evaluation_example/results --ckpt_path data/ckpt --device cuda

You can follow the example in data/evaluation_example to evaluate your own generated videos.

License Agreement

Please see LICENSE.

The MiraData dataset is only available for informational purposes only. The copyright remains with the original owners of the video.
All videos of the MiraData dataset are obtained from the Internet which are not property of our institutions. Our institution are not responsible for the content nor the meaning of these videos.
You agree not to reproduce, duplicate, copy, sell, trade, resell or exp

Related Skills

docs-writer

99.6k

`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie

model-usage

342.0k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

arscontexta

2.9k

Claude Code plugin that generates individualized knowledge systems from conversation. You describe how you think and work, have a conversation and get a complete second brain as markdown files you own.

cursor-agent-tracking

134

A repository that provides a structured system for maintaining context and tracking changes in Cursor's AGENT mode conversations through template files, enabling better continuity and organization of AI interactions.

mira-space

View profile

View on GitHub

GitHub Stars511

CategoryContent

Updated6d ago

Forks16

mira-space/MiraData

Languages

Python

Security Score

95/100

Audited on Mar 24, 2026

No findings