<div align="center" style="font-family: charter;"> <h1><img src="docs/resources/s_logo.png" width="4%"/>From Flatland to Space: Teaching Vision-Language Models to Perceive and Reason in 3D</h1> <a href="https://arxiv.org/abs/2503.22976" target="_blank"> <img alt="arXiv" src="https://img.shields.io/badge/arXiv-SPAR-red?logo=arxiv" height="20" /> </a> <a href="https://fudan-zvg.github.io/spar/" target="_blank"> <img alt="Website" src="https://img.shields.io/badge/🌎_Website-SPAR-blue.svg" height="20" /> </a> <a href="https://huggingface.co/datasets/jasonzhango/SPAR-7M" target="_blank"> <img alt="HF Dataset: SPAR-7M" src="https://img.shields.io/badge/%F0%9F%A4%97%20_Dataset-SPAR--7M-ffc107?color=ffc107&logoColor=white" height="20" /> </a> <a href="https://huggingface.co/datasets/jasonzhango/SPAR-Bench" target="_blank"> <img alt="HF Dataset: SPAR-Bench" src="https://img.shields.io/badge/%F0%9F%A4%97%20_Benchmark-SPAR--Bench-ffc107?color=ffc107&logoColor=white" height="20" /> </a> <a href="https://neurips.cc/virtual/2025/poster/121742" target="_blank"> <img alt="NeurIPS 2025 Datasets & Benchmarks" src="https://img.shields.io/badge/NeurIPS%E2%80%9925-Datasets%20%26%20Benchmarks-6f42c1" height="20" /> </a> <div> Jiahui Zhang1*, Yurui Chen1*, Yanpeng Zhou2*, Yueming Xu1, Ze Huang1, Jilin Mei1, Junhui Chen1, Yu-Jie Yuan2, Xinyue Cai2, Guowei Huang2, Xingyue Quan2, Hang Xu2, <a href="https://lzrobots.github.io/" target="_blank">Li Zhang</a>1 </div> <div> 1Fudan University&emsp; 2Huawei Noah’s Ark Lab&emsp; </div> <img src="docs/resources/teaser.png" width="100%"/> Overview of our Spatial Perception And Reasoning (SPAR) dataset and benchmark. Our dataset is sourced from 4,500 scenes and comprises 33 spatial tasks spanning single-view, multi-view, and video settings. Our benchmark includes over 7,000 carefully curated high-quality samples to comprehensively evaluate the spatial perception and understanding capabilities of existing models. </div>

📰 News (2026-01-05): SPAR-Bench is now supported in EASI — see Evaluate via EASI.

SPAR-7M
SPAR-Bench
- Task List & Cognitive Levels
- Evaluation Results
RUN Your Own Evaluation
Limitations
Acknowledgement
Bibtex

📦 SPAR-7M

📌 Dataset Summary

SPAR-7M is a large-scale vision-language dataset designed to study spatial perception and reasoning in complex 3D scenes. Built upon a novel 2D data generation pipeline, it translates 3D ground-truth from richly annotated scenes into diverse, scalable spatial QA pairs. The dataset spans 33 task types, ranging from basic perception (e.g., depth, distance) to complex reasoning (e.g., spatial imagination, object relation inference), and supports single-view, multi-view, and video-based formats.

Unlike prior datasets, SPAR-7M focuses on spatial diversity and compositionality. It enables systematic evaluation across object-object and object-camera relations, and offers fine-grained control over QA type, view configuration, and cognitive levels.

🪄 Task Types

SPAR-7M covers a wide range of spatial perception and understanding abilities, organized along multiple dimensions:

Cognitive Level
- Low-level (Perception): Depth estimation, distance prediction, object location, etc.
- Medium-level (P-2-R): View change inference, object matching, etc.
- High-level (Reasoning): Spatial imagination, navigation, multi-view relation inference, etc.
Spatial Relation Type
- Object–Object (OO): Inferring spatial relationships between objects.
- Object–Camera (OC): Estimating object properties relative to the camera (e.g., position, distance, direction).
Input Modality
- Single-view: Tasks using one image as input.
- Multi-view: Tasks requiring reasoning across 3–5 images.
- Video: Tasks derived from temporally coherent RGB sequences.

Each QA pair is grounded in precise 3D geometry, enabling reliable evaluation and training for spatial tasks.

📄 Data Format and Examples

Each QA sample consists of:

{
    "id": "scene0261_00_16", 
    "conversations": 
        [{
            "from": "human", 
            "value": "With the counter (red point) having a depth of 1.6 meters, determine the depth of towel (blue point) in the same frame.  Calculate or judge based on the 3D center points of these objects. The unit is meter."},
         {
            "from": "gpt", 
            "value": "towel's central depth is estimated to be about 1.5 meters."}], 
    "image": ["scene0261_00/image_color/543.jpg"], 
    "type": "depth_prediction_oc", 
    "depth": ["scene0261_00/image_depth/543.png"],
    "red_point": [[553, 397]], 
    "blue_point": [[641, 838]]
}

We also provide metadata for all images, including:

Camera intrinsics and extrinsics
Depths

📥 Download

We provide two versions of the SPAR-7M dataset:

| Version | Description | |---------------------|-----------------------------------------------------------------------------| | SPAR-7M | Clean and compact version, includes images, questions, answers, and labels. | | SPAR-7M-RGBD | Full version with additional depths, camera intrinsics, and extrinsics. Ideal for 3D-aware training. |

You can download both versions from Hugging Face:

# Download SPAR-7M (default)
huggingface-cli download jasonzhango/SPAR-7M --repo-type dataset

# Download SPAR-7M-RGBD (with depth and camera parameters)
huggingface-cli download jasonzhango/SPAR-7M-RGBD --repo-type dataset

These datasets are split into multiple .tar.gz parts due to Hugging Face file size limits. After downloading all parts, run the following to extract:

# For SPAR-7M
cat spar-*.tar.gz | tar -xvzf -

# For SPAR-7M-RGBD
cat spar-rgbd-*.tar.gz | tar -xvzf -

Alternatively, if Hugging Face is not accessible, you can use the provided script:

wget https://hf-mirror.com/hfd/hfd.sh

chmod a+x hfd.sh

export HF_ENDPOINT=https://hf-mirror.com

./hfd.sh jasonzhango/SPAR-7M --dataset
./hfd.sh jasonzhango/SPAR-7M-RGBD --dataset

The dataset directory structure is:

spar/
├── rxr/
├── scannet/
│   ├── images/
│   |   └── scene0000_00/
│   |       ├── image_color/
│   |       ├── video_color/
│   |       ├── image_depth/           # only in SPAR-7M-RGBD
│   |       ├── video_depth/           # only in SPAR-7M-RGBD
│   |       ├── pose/                  # only in SPAR-7M-RGBD
│   |       ├── video_pose/            # only in SPAR-7M-RGBD
│   |       ├── intrinsic/             # only in SPAR-7M-RGBD
│   |       └── video_idx.txt
│   └── qa_jsonl/
│       ├── train/
│       |   ├── depth_prediction_oo/
│       |   |   ├── fill/
│       |   |   |   └── fill_76837.jsonl
│       |   |   ├── select/
│       |   |   └── sentence/
│       |   ├── obj_spatial_relation_oc/
│       |   └── spatial_imagination_oo_mv/
│       └── val/
├── scannetpp/
└── structured3d/

Each QA task (e.g., depth_prediction_oc, spatial_relation_oo_mv, etc.) is organized by task type, with subfolders for different answer formats:

fill/ — numerical or descriptive answers
select/ — multiple choice
sentence/ — natural language answers

🛠️ Generate Training Index Files

To train models on SPAR-7M or SPAR-7M-RGBD, we first convert raw .jsonl QA annotations into training index files in the InternVL-style data_json format.

We provide a script to automate this:

ln -s path-to-spar-7m ./
python datasets/generate_data_json.py

This script will:

Recursively scan all *.jsonl files under the spar/ directory
Convert them into structured data_json entries
Save the output files to the data_jsons/ folder

By default, the script processes four sub-datasets:

if __name__ == "__main__":
    dataset_list = [
        "rxr",
        "scannet",
        "scannetpp",
        "structured3d",
    ]
    for dataset in dataset_list:
        process_dataset(dataset)

You will find the generated training index files here:

data_jsons/
├── scannet_7799k.json       # Index for all SPAR-7M QA from ScanNet scenes
├── scannetpp_5941k.json     # Index for ScanNet++ scenes
├── ...

🔀 Mix Data for Pretraining

Once you've generated individual data_json files, you can use the provided script to mix them with customized ratios, both per-dataset and per QA type.

Run the script:

ln -s path-to-spar-7m ./
python datasets/mix_data.py

This scri

SPAR

Install / Use

README

Contents