SpatialVID
[CVPR 2026] SpatialVID: A Large-Scale Video Dataset with Spatial Annotations
Install / Use
/learn @NJU-3DV/SpatialVIDREADME
🎉NEWS
- [2026.02.21] 🎉 SpatialVID is accepted by CVPR 2026!
- [2025.10.11] 🐳 Docker support is now available, featuring a pre-configured environment with NVIDIA GPU-accelerated FFmpeg.
- [2025.09.29] 🚀 Depth data for the SpatialVID-HQ dataset is now officially available.
- [2025.09.24] 🤗 Raw metadata access is now available via a gated HuggingFace dataset to better support community research!!
- [2025.09.24] 🔭 Enhanced instructions for better camera control are updated.
- [2025.09.18] 🎆 SpatialVID dataset is now available on both HuggingFace and ModelScope.
- [2025.09.14] 📢 We have also uploaded the SpatialVID-HQ dataset to ModelScope offering more diverse download options.
- [2025.09.11] 🔥 Our paper, code and SpatialVID-HQ dataset are released!
[✍️ Note] Each video clip is paired with a dedicated annotation folder (named after the video’s id). The folder contains 5 key files, and details regarding these files can be found in Detailed Explanation of Annotation Files.
Abstract
Significant progress has been made in spatial intelligence, spanning both spatial reconstruction and world exploration. However, the scalability and real-world fidelity of current models remain severely constrained by the scarcity of large-scale, high-quality training data. While several datasets provide camera pose information, they are typically limited in scale, diversity, and annotation richness, particularly for real-world dynamic scenes with ground-truth camera motion. To this end, we collect SpatialVID, a dataset consisting of a large corpus of in-the-wild videos with diverse scenes, camera movements and dense 3D annotations such as per-frame camera poses, depth, and motion instructions. Specifically, we collect more than 21,000 hours of raw videos, and process them into 2.7 million clips through a hierarchical filtering pipeline, totaling 7,089 hours of dynamic content. A subsequent annotation pipeline enriches these clips with detailed spatial and semantic information, including camera poses, depth maps, dynamic masks, structured captions, and serialized motion instructions. Analysis of SpatialVID's data statistics reveals a richness and diversity that directly foster improved model generalization and performance, establishing it as a key asset for the video and 3D vision research community.
Preparation
This section describes how to set up the environment manually. For a simpler, containerized setup, please refer to the Docker Setup and Usage section.
Environment
-
Necessary packages
git clone --recursive https://github.com/NJU-3DV/SpatialVID.git cd SpatialVid conda create -n SpatialVID python=3.10.13 conda activate SpatialVID pip install -r requirements/requirements.txt -
Package needed for scoring
pip install paddlepaddle-gpu==3.0.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/ pip install -r requirements/requirements_scoring.txtIgnore the warning about
nvidia-nccl-cu12andnumpyversion, it is not a problem.About FFMPEG, please refer to the
INSTALL.mdfor detailed instructions on how to install ffmpeg. After installation, replace theFFMPEG_PATHvariable in thescoring/motion/inference.pyandutils/cut.pywith the actual path to your ffmpeg executable, default is/usr/local/bin/ffmpeg.⚠️ If your videos are in av1 codec instead of h264, you need to install ffmpeg (already in our requirement script), then run the following to make conda support av1 codec:
pip uninstall opencv-python conda install -c conda-forge opencv==4.11.0If unfortunately your conda environment still cannot support av1 codec, you can use the
--backend avoption in the scoring scripts to use PyAV as the video reading backend. But note that using PyAV for frame extraction may lead to slight inaccuracies in frame positioning. -
Package needed for annotation
pip install -r requirements/requirements_annotation.txtCompile the extensions for the camera tracking module:
cd camera_pose_annotation/base python setup.py install -
[Optional] Package needed for visualization
pip install plotly pip install -e viser
Model Weight
Download the model weights used in our experiments:
bash scripts/download_checkpoints.sh
Or you can manually download the model weights from the following links and place them in the appropriate directories.
| Model | File Name | URL | | ------------------- | ----------------------- | --------------------------------------------------------------------------------------------------------------- | | Aesthetic Predictor | aesthetic | 🔗 | | MegaSAM | megasam_final | 🔗 | | RAFT | raft-things | 🔗 | | Depth Anything | Depth-Anything-V2-Large | 🔗 | | UniDepth | unidepth-v2-vitl14 | 🔗 | | SAM | sam2.1-hiera-large | 🔗 |
Quick Start
The whole pipeline is illustrated in the figure below:
<p align="center"> <img src="assets/pipeline.png" height=340> </p>-
Scoring
bash scripts/scoring.shInside the
scoring.shscript, you need to set the following variables:ROOT_VIDEOis the directory containing the input video files.OUTPUT_DIRis the directory where the output files will be saved.
-
Annotation
bash scripts/annotation.shInside the
annotation.shscript, you need to set the following variables:CSVis the CSV file generated by the scoring script, default is$OUTPUT_DIR/results.csv.OUTPUT_DIRis the directory where the output files will be saved.
-
Caption
bash scripts/caption.shInside the
caption.shscript, you need to set the following variables:CSVis the CSV file generated by the annotation script, default is$OUTPUT_DIR/results.csv.SRC_DIRis the annotation output directory, default is the same as theOUTPUT_DIRin the annotation step.OUTPUT_DIRis the directory where the output files will be sa
