SkillAgentSearch skills...

SPAR

From Flatland to Space (SPAR). Accepted to NeurIPS 2025 Datasets & Benchmarks. A large-scale dataset & benchmark for 3D spatial perception and reasoning in VLMs.

Install / Use

/learn @LogosRoboticsGroup/SPAR
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<div align="center" style="font-family: charter;"> <h1><img src="docs/resources/s_logo.png" width="4%"/><i>From Flatland to Space</i>:</br> Teaching Vision-Language Models to Perceive and Reason in 3D</h1> <a href="https://arxiv.org/abs/2503.22976" target="_blank"> <img alt="arXiv" src="https://img.shields.io/badge/arXiv-SPAR-red?logo=arxiv" height="20" /> </a> <a href="https://fudan-zvg.github.io/spar/" target="_blank"> <img alt="Website" src="https://img.shields.io/badge/🌎_Website-SPAR-blue.svg" height="20" /> </a></br> <a href="https://huggingface.co/datasets/jasonzhango/SPAR-7M" target="_blank"> <img alt="HF Dataset: SPAR-7M" src="https://img.shields.io/badge/%F0%9F%A4%97%20_Dataset-SPAR--7M-ffc107?color=ffc107&logoColor=white" height="20" /> </a> <a href="https://huggingface.co/datasets/jasonzhango/SPAR-Bench" target="_blank"> <img alt="HF Dataset: SPAR-Bench" src="https://img.shields.io/badge/%F0%9F%A4%97%20_Benchmark-SPAR--Bench-ffc107?color=ffc107&logoColor=white" height="20" /> </a> <a href="https://neurips.cc/virtual/2025/poster/121742" target="_blank"> <img alt="NeurIPS 2025 Datasets & Benchmarks" src="https://img.shields.io/badge/NeurIPS%E2%80%9925-Datasets%20%26%20Benchmarks-6f42c1" height="20" /> </a> <div> <span>Jiahui Zhang</span><sup>1*</sup>, <span>Yurui Chen</span><sup>1*</sup>, <span>Yanpeng Zhou</span><sup>2*</sup>, <span>Yueming Xu</span><sup>1</sup>, <span>Ze Huang</span><sup>1</sup>, <span>Jilin Mei</span><sup>1</sup>, <span>Junhui Chen</span><sup>1</sup>, <span>Yu-Jie Yuan</span><sup>2</sup>, <span>Xinyue Cai</span><sup>2</sup>, <span>Guowei Huang</span><sup>2</sup>, <span>Xingyue Quan</span><sup>2</sup>, <span>Hang Xu</span><sup>2</sup>, </span> <a href="https://lzrobots.github.io/" target="_blank">Li Zhang</a><sup>1</sup></span> </div> <div> <sup>1</sup>Fudan University&emsp; <sup>2</sup>Huawei Noah’s Ark Lab&emsp; </div> <img src="docs/resources/teaser.png" width="100%"/> <p align="justify"><i> Overview of our <strong>Spatial Perception And Reasoning (SPAR)</strong> dataset and benchmark. Our dataset is sourced from 4,500 scenes and comprises 33 spatial tasks spanning single-view, multi-view, and video settings. Our benchmark includes over 7,000 carefully curated high-quality samples to comprehensively evaluate the spatial perception and understanding capabilities of existing models.</i></p> </div>

📰 News (2026-01-05): SPAR-Bench is now supported in EASI — see Evaluate via EASI.

Contents

📦 SPAR-7M

📌 Dataset Summary

<div align="center"> <img src="docs/resources/data_stats.png" width="50%"/> </div>

SPAR-7M is a large-scale vision-language dataset designed to study spatial perception and reasoning in complex 3D scenes. Built upon a novel 2D data generation pipeline, it translates 3D ground-truth from richly annotated scenes into diverse, scalable spatial QA pairs. The dataset spans 33 task types, ranging from basic perception (e.g., depth, distance) to complex reasoning (e.g., spatial imagination, object relation inference), and supports single-view, multi-view, and video-based formats.

Unlike prior datasets, SPAR-7M focuses on spatial diversity and compositionality. It enables systematic evaluation across object-object and object-camera relations, and offers fine-grained control over QA type, view configuration, and cognitive levels.

🪄 Task Types

<div align="center"> <img src="docs/resources/task_vis.png" width="100%"/> </div>

SPAR-7M covers a wide range of spatial perception and understanding abilities, organized along multiple dimensions:

  • Cognitive Level

    • Low-level (Perception): Depth estimation, distance prediction, object location, etc.
    • Medium-level (P-2-R): View change inference, object matching, etc.
    • High-level (Reasoning): Spatial imagination, navigation, multi-view relation inference, etc.
  • Spatial Relation Type

    • Object–Object (OO): Inferring spatial relationships between objects.
    • Object–Camera (OC): Estimating object properties relative to the camera (e.g., position, distance, direction).
  • Input Modality

    • Single-view: Tasks using one image as input.
    • Multi-view: Tasks requiring reasoning across 3–5 images.
    • Video: Tasks derived from temporally coherent RGB sequences.

Each QA pair is grounded in precise 3D geometry, enabling reliable evaluation and training for spatial tasks.

📄 Data Format and Examples

Each QA sample consists of:

{
    "id": "scene0261_00_16", 
    "conversations": 
        [{
            "from": "human", 
            "value": "With the counter (red point) having a depth of 1.6 meters, determine the depth of towel (blue point) in the same frame.  Calculate or judge based on the 3D center points of these objects. The unit is meter."},
         {
            "from": "gpt", 
            "value": "towel's central depth is estimated to be about 1.5 meters."}], 
    "image": ["scene0261_00/image_color/543.jpg"], 
    "type": "depth_prediction_oc", 
    "depth": ["scene0261_00/image_depth/543.png"],
    "red_point": [[553, 397]], 
    "blue_point": [[641, 838]]
}

We also provide metadata for all images, including:

  • Camera intrinsics and extrinsics
  • Depths

📥 Download

We provide two versions of the SPAR-7M dataset:

| Version | Description | |---------------------|-----------------------------------------------------------------------------| | SPAR-7M | Clean and compact version, includes images, questions, answers, and labels. | | SPAR-7M-RGBD | Full version with additional depths, camera intrinsics, and extrinsics. Ideal for 3D-aware training. |

You can download both versions from Hugging Face:

# Download SPAR-7M (default)
huggingface-cli download jasonzhango/SPAR-7M --repo-type dataset

# Download SPAR-7M-RGBD (with depth and camera parameters)
huggingface-cli download jasonzhango/SPAR-7M-RGBD --repo-type dataset

These datasets are split into multiple .tar.gz parts due to Hugging Face file size limits. After downloading all parts, run the following to extract:

# For SPAR-7M
cat spar-*.tar.gz | tar -xvzf -

# For SPAR-7M-RGBD
cat spar-rgbd-*.tar.gz | tar -xvzf -

Alternatively, if Hugging Face is not accessible, you can use the provided script:

wget https://hf-mirror.com/hfd/hfd.sh

chmod a+x hfd.sh

export HF_ENDPOINT=https://hf-mirror.com

./hfd.sh jasonzhango/SPAR-7M --dataset
./hfd.sh jasonzhango/SPAR-7M-RGBD --dataset

The dataset directory structure is:

spar/
├── rxr/
├── scannet/
│   ├── images/
│   |   └── scene0000_00/
│   |       ├── image_color/
│   |       ├── video_color/
│   |       ├── image_depth/           # only in SPAR-7M-RGBD
│   |       ├── video_depth/           # only in SPAR-7M-RGBD
│   |       ├── pose/                  # only in SPAR-7M-RGBD
│   |       ├── video_pose/            # only in SPAR-7M-RGBD
│   |       ├── intrinsic/             # only in SPAR-7M-RGBD
│   |       └── video_idx.txt
│   └── qa_jsonl/
│       ├── train/
│       |   ├── depth_prediction_oo/
│       |   |   ├── fill/
│       |   |   |   └── fill_76837.jsonl
│       |   |   ├── select/
│       |   |   └── sentence/
│       |   ├── obj_spatial_relation_oc/
│       |   └── spatial_imagination_oo_mv/
│       └── val/
├── scannetpp/
└── structured3d/

Each QA task (e.g., depth_prediction_oc, spatial_relation_oo_mv, etc.) is organized by task type, with subfolders for different answer formats:

  • fill/ — numerical or descriptive answers
  • select/ — multiple choice
  • sentence/ — natural language answers

🛠️ Generate Training Index Files

To train models on SPAR-7M or SPAR-7M-RGBD, we first convert raw .jsonl QA annotations into training index files in the InternVL-style data_json format.

We provide a script to automate this:

ln -s path-to-spar-7m ./
python datasets/generate_data_json.py

This script will:

  • Recursively scan all *.jsonl files under the spar/ directory
  • Convert them into structured data_json entries
  • Save the output files to the data_jsons/ folder

By default, the script processes four sub-datasets:

if __name__ == "__main__":
    dataset_list = [
        "rxr",
        "scannet",
        "scannetpp",
        "structured3d",
    ]
    for dataset in dataset_list:
        process_dataset(dataset)

You will find the generated training index files here:

data_jsons/
├── scannet_7799k.json       # Index for all SPAR-7M QA from ScanNet scenes
├── scannetpp_5941k.json     # Index for ScanNet++ scenes
├── ...

🔀 Mix Data for Pretraining

Once you've generated individual data_json files, you can use the provided script to mix them with customized ratios, both per-dataset and per QA type.

Run the script:

ln -s path-to-spar-7m ./
python datasets/mix_data.py

This scri

View on GitHub
GitHub Stars82
CategoryDevelopment
Updated2d ago
Forks1

Languages

Python

Security Score

80/100

Audited on Apr 1, 2026

No findings