SPAR
From Flatland to Space (SPAR). Accepted to NeurIPS 2025 Datasets & Benchmarks. A large-scale dataset & benchmark for 3D spatial perception and reasoning in VLMs.
Install / Use
/learn @LogosRoboticsGroup/SPARREADME
📰 News (2026-01-05): SPAR-Bench is now supported in EASI — see Evaluate via EASI.
Contents
📦 SPAR-7M
📌 Dataset Summary
<div align="center"> <img src="docs/resources/data_stats.png" width="50%"/> </div>SPAR-7M is a large-scale vision-language dataset designed to study spatial perception and reasoning in complex 3D scenes. Built upon a novel 2D data generation pipeline, it translates 3D ground-truth from richly annotated scenes into diverse, scalable spatial QA pairs. The dataset spans 33 task types, ranging from basic perception (e.g., depth, distance) to complex reasoning (e.g., spatial imagination, object relation inference), and supports single-view, multi-view, and video-based formats.
Unlike prior datasets, SPAR-7M focuses on spatial diversity and compositionality. It enables systematic evaluation across object-object and object-camera relations, and offers fine-grained control over QA type, view configuration, and cognitive levels.
🪄 Task Types
<div align="center"> <img src="docs/resources/task_vis.png" width="100%"/> </div>SPAR-7M covers a wide range of spatial perception and understanding abilities, organized along multiple dimensions:
-
Cognitive Level
- Low-level (Perception): Depth estimation, distance prediction, object location, etc.
- Medium-level (P-2-R): View change inference, object matching, etc.
- High-level (Reasoning): Spatial imagination, navigation, multi-view relation inference, etc.
-
Spatial Relation Type
- Object–Object (OO): Inferring spatial relationships between objects.
- Object–Camera (OC): Estimating object properties relative to the camera (e.g., position, distance, direction).
-
Input Modality
- Single-view: Tasks using one image as input.
- Multi-view: Tasks requiring reasoning across 3–5 images.
- Video: Tasks derived from temporally coherent RGB sequences.
Each QA pair is grounded in precise 3D geometry, enabling reliable evaluation and training for spatial tasks.
📄 Data Format and Examples
Each QA sample consists of:
{
"id": "scene0261_00_16",
"conversations":
[{
"from": "human",
"value": "With the counter (red point) having a depth of 1.6 meters, determine the depth of towel (blue point) in the same frame. Calculate or judge based on the 3D center points of these objects. The unit is meter."},
{
"from": "gpt",
"value": "towel's central depth is estimated to be about 1.5 meters."}],
"image": ["scene0261_00/image_color/543.jpg"],
"type": "depth_prediction_oc",
"depth": ["scene0261_00/image_depth/543.png"],
"red_point": [[553, 397]],
"blue_point": [[641, 838]]
}
We also provide metadata for all images, including:
- Camera intrinsics and extrinsics
- Depths
📥 Download
We provide two versions of the SPAR-7M dataset:
| Version | Description |
|---------------------|-----------------------------------------------------------------------------|
| SPAR-7M | Clean and compact version, includes images, questions, answers, and labels. |
| SPAR-7M-RGBD | Full version with additional depths, camera intrinsics, and extrinsics. Ideal for 3D-aware training. |
You can download both versions from Hugging Face:
# Download SPAR-7M (default)
huggingface-cli download jasonzhango/SPAR-7M --repo-type dataset
# Download SPAR-7M-RGBD (with depth and camera parameters)
huggingface-cli download jasonzhango/SPAR-7M-RGBD --repo-type dataset
These datasets are split into multiple .tar.gz parts due to Hugging Face file size limits. After downloading all parts, run the following to extract:
# For SPAR-7M
cat spar-*.tar.gz | tar -xvzf -
# For SPAR-7M-RGBD
cat spar-rgbd-*.tar.gz | tar -xvzf -
Alternatively, if Hugging Face is not accessible, you can use the provided script:
wget https://hf-mirror.com/hfd/hfd.sh
chmod a+x hfd.sh
export HF_ENDPOINT=https://hf-mirror.com
./hfd.sh jasonzhango/SPAR-7M --dataset
./hfd.sh jasonzhango/SPAR-7M-RGBD --dataset
The dataset directory structure is:
spar/
├── rxr/
├── scannet/
│ ├── images/
│ | └── scene0000_00/
│ | ├── image_color/
│ | ├── video_color/
│ | ├── image_depth/ # only in SPAR-7M-RGBD
│ | ├── video_depth/ # only in SPAR-7M-RGBD
│ | ├── pose/ # only in SPAR-7M-RGBD
│ | ├── video_pose/ # only in SPAR-7M-RGBD
│ | ├── intrinsic/ # only in SPAR-7M-RGBD
│ | └── video_idx.txt
│ └── qa_jsonl/
│ ├── train/
│ | ├── depth_prediction_oo/
│ | | ├── fill/
│ | | | └── fill_76837.jsonl
│ | | ├── select/
│ | | └── sentence/
│ | ├── obj_spatial_relation_oc/
│ | └── spatial_imagination_oo_mv/
│ └── val/
├── scannetpp/
└── structured3d/
Each QA task (e.g., depth_prediction_oc, spatial_relation_oo_mv, etc.) is organized by task type, with subfolders for different answer formats:
fill/— numerical or descriptive answersselect/— multiple choicesentence/— natural language answers
🛠️ Generate Training Index Files
To train models on SPAR-7M or SPAR-7M-RGBD, we first convert raw .jsonl QA annotations into training index files in the InternVL-style data_json format.
We provide a script to automate this:
ln -s path-to-spar-7m ./
python datasets/generate_data_json.py
This script will:
- Recursively scan all
*.jsonlfiles under thespar/directory - Convert them into structured data_json entries
- Save the output files to the
data_jsons/folder
By default, the script processes four sub-datasets:
if __name__ == "__main__":
dataset_list = [
"rxr",
"scannet",
"scannetpp",
"structured3d",
]
for dataset in dataset_list:
process_dataset(dataset)
You will find the generated training index files here:
data_jsons/
├── scannet_7799k.json # Index for all SPAR-7M QA from ScanNet scenes
├── scannetpp_5941k.json # Index for ScanNet++ scenes
├── ...
🔀 Mix Data for Pretraining
Once you've generated individual data_json files, you can use the provided script to mix them with customized ratios, both per-dataset and per QA type.
Run the script:
ln -s path-to-spar-7m ./
python datasets/mix_data.py
This scri
