<h1 align='center'>SpatialVID: A Large-Scale Video Dataset with Spatial Annotations</h1> <div align='center'> <a href='https://oiiiwjh.github.io/' target='_blank'>Jiahao Wang</a><sup>1*</sup> <a href='https://felixyuan-yf.github.io/' target='_blank'>Yufeng Yuan</a><sup>1*</sup> <a href='https://zrj-cn.github.io/' target='_blank'>Rujie Zheng</a><sup>1*</sup> <a href='https://linyou.github.io' target='_blank'>Youtian Lin</a><sup>1</sup> <a href='https://ygaojiany.github.io' target='_blank'>Jian Gao</a><sup>1</sup> <a href='https://linzhuo.xyz' target='_blank'>Lin-Zhuo Chen</a><sup>1</sup> </div> <div align='center'> <a href='https://openreview.net/profile?id=~yajie_bao5' target='_blank'>Yajie Bao</a><sup>1</sup> <a href='https://github.com/YeeZ93' target='_blank'>Yi Zhang</a><sup>1</sup> <a href='https://github.com/ozchango' target='_blank'>Chang Zeng</a><sup>1</sup> <a href='https://github.com/yxzhou217' target='_blank'>Yanxi Zhou</a><sup>1</sup> <a href='https://www.xxlong.site/index.html' target='_blank'>Xiaoxiao Long</a><sup>1</sup> <a href='http://zhuhao.cc/home/' target='_blank'>Hao Zhu</a><sup>1</sup> </div> <div align='center'> <a href='http://zhaoxiangzhang.net/' target='_blank'>Zhaoxiang Zhang</a><sup>2</sup> <a href='https://cite.nju.edu.cn/People/Faculty/20190621/i5054.html' target='_blank'>Xun Cao</a><sup>1</sup> <a href='https://yoyo000.github.io/' target='_blank'>Yao Yao</a><sup>1†</sup> </div> <div align='center'> <sup>1</sup>Nanjing University <sup>2</sup>Institute of Automation, Chinese Academy of Science </div> <div align='center'> *Equal Contribution †Corresponding Author </div> <div align="center"> <strong>CVPR 2026</strong> </div> <br> <div align="center"> <a href="https://nju-3dv.github.io/projects/SpatialVID/"><img src="https://img.shields.io/static/v1?label=SpatialVID&message=Project&color=purple"></a> <a href="https://arxiv.org/abs/2509.09676"><img src="https://img.shields.io/static/v1?label=Paper&message=Arxiv&color=red&logo=arxiv"></a> <a href="https://github.com/NJU-3DV/spatialVID"><img src="https://img.shields.io/static/v1?label=Code&message=Github&color=blue&logo=github"></a> <a href="https://huggingface.co/SpatialVID"><img src="https://img.shields.io/static/v1?label=Dataset&message=HuggingFace&color=yellow&logo=huggingface"></a> <a href="https://www.modelscope.cn/organization/SpatialVID"><img src="https://img.shields.io/static/v1?label=Dataset&message=ModelScope&color=4285F4"></a> </div> <p align="center"> <img src="assets/overview.png" height=400> </p>

🎉NEWS

[2026.02.21] 🎉 SpatialVID is accepted by CVPR 2026!
[2025.10.11] 🐳 Docker support is now available, featuring a pre-configured environment with NVIDIA GPU-accelerated FFmpeg.
[2025.09.29] 🚀 Depth data for the SpatialVID-HQ dataset is now officially available.
[2025.09.24] 🤗 Raw metadata access is now available via a gated HuggingFace dataset to better support community research!!
[2025.09.24] 🔭 Enhanced instructions for better camera control are updated.
[2025.09.18] 🎆 SpatialVID dataset is now available on both HuggingFace and ModelScope.
[2025.09.14] 📢 We have also uploaded the SpatialVID-HQ dataset to ModelScope offering more diverse download options.
[2025.09.11] 🔥 Our paper, code and SpatialVID-HQ dataset are released!

[✍️ Note] Each video clip is paired with a dedicated annotation folder (named after the video’s id). The folder contains 5 key files, and details regarding these files can be found in Detailed Explanation of Annotation Files.

Abstract

Significant progress has been made in spatial intelligence, spanning both spatial reconstruction and world exploration. However, the scalability and real-world fidelity of current models remain severely constrained by the scarcity of large-scale, high-quality training data. While several datasets provide camera pose information, they are typically limited in scale, diversity, and annotation richness, particularly for real-world dynamic scenes with ground-truth camera motion. To this end, we collect SpatialVID, a dataset consisting of a large corpus of in-the-wild videos with diverse scenes, camera movements and dense 3D annotations such as per-frame camera poses, depth, and motion instructions. Specifically, we collect more than 21,000 hours of raw videos, and process them into 2.7 million clips through a hierarchical filtering pipeline, totaling 7,089 hours of dynamic content. A subsequent annotation pipeline enriches these clips with detailed spatial and semantic information, including camera poses, depth maps, dynamic masks, structured captions, and serialized motion instructions. Analysis of SpatialVID's data statistics reveals a richness and diversity that directly foster improved model generalization and performance, establishing it as a key asset for the video and 3D vision research community.

Preparation

This section describes how to set up the environment manually. For a simpler, containerized setup, please refer to the Docker Setup and Usage section.

Environment

Necessary packages

git clone --recursive https://github.com/NJU-3DV/SpatialVID.git
cd SpatialVid
conda create -n SpatialVID python=3.10.13
conda activate SpatialVID
pip install -r requirements/requirements.txt

Package needed for scoring
```
pip install paddlepaddle-gpu==3.0.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
pip install -r requirements/requirements_scoring.txt
```
Ignore the warning about nvidia-nccl-cu12 and numpy version, it is not a problem.

About FFMPEG, please refer to the INSTALL.md for detailed instructions on how to install ffmpeg. After installation, replace the FFMPEG_PATH variable in the scoring/motion/inference.py and utils/cut.py with the actual path to your ffmpeg executable, default is /usr/local/bin/ffmpeg.

⚠️ If your videos are in av1 codec instead of h264, you need to install ffmpeg (already in our requirement script), then run the following to make conda support av1 codec:
```
pip uninstall opencv-python
conda install -c conda-forge opencv==4.11.0
```
If unfortunately your conda environment still cannot support av1 codec, you can use the --backend av option in the scoring scripts to use PyAV as the video reading backend. But note that using PyAV for frame extraction may lead to slight inaccuracies in frame positioning.

Package needed for annotation

pip install -r requirements/requirements_annotation.txt

Compile the extensions for the camera tracking module:

cd camera_pose_annotation/base
python setup.py install

[Optional] Package needed for visualization
```
pip install plotly
pip install -e viser
```

Model Weight

Download the model weights used in our experiments:

bash scripts/download_checkpoints.sh

Or you can manually download the model weights from the following links and place them in the appropriate directories.

| Model | File Name | URL | | ------------------- | ----------------------- | --------------------------------------------------------------------------------------------------------------- | | Aesthetic Predictor | aesthetic | 🔗 | | MegaSAM | megasam_final | 🔗 | | RAFT | raft-things | 🔗 | | Depth Anything | Depth-Anything-V2-Large | 🔗 | | UniDepth | unidepth-v2-vitl14 | 🔗 | | SAM | sam2.1-hiera-large | 🔗 |

Quick Start

The whole pipeline is illustrated in the figure below:

Scoring
```
bash scripts/scoring.sh
```
Inside the scoring.sh script, you need to set the following variables:
- ROOT_VIDEO is the directory containing the input video files.
- OUTPUT_DIR is the directory where the output files will be saved.
Annotation
```
bash scripts/annotation.sh
```
Inside the annotation.sh script, you need to set the following variables:
- CSV is the CSV file generated by the scoring script, default is $OUTPUT_DIR/results.csv.
- OUTPUT_DIR is the directory where the output files will be saved.
Caption
```
bash scripts/caption.sh
```
Inside the caption.sh script, you need to set the following variables:
- CSV is the CSV file generated by the annotation script, default is $OUTPUT_DIR/results.csv.
- SRC_DIR is the annotation output directory, default is the same as the OUTPUT_DIR in the annotation step.
- OUTPUT_DIR is the directory where the output files will be sa

SpatialVID

Install / Use

README