Spa3R

Spa3R: Predictive Spatial Field Modeling for 3D Visual Reasoning

Generate Convert Improve

Install / Use

/learn @hustvl/Spa3R

About this skill

Quality Score

0/100

README

🔮 Spaeing the Unseen

Spa3R: Predictive Spatial Field Modeling for 3D Visual Reasoning

Haoyi Jiang1, Liu Liu2, Xinjie Wang2, Yonghao He3, Wei Sui3, Zhizhong Su2, Wenyu Liu1, Xinggang Wang1 1Huazhong University of Science & Technology, 2Horizon Robotics, 3D-Robotics

</div>

Installation

Please clone this project with --recursive.

pip install -r requirements.txt
pip install flash-attn --no-build-isolation

pip install submodules/vggt
pip install -e submodules/lmms-eval

Data Preparation

1. Pre-training

We utilize a combination of large-scale indoor scene datasets: ScanNet and ScanNet++.

2. Instruction Tuning

Video-centric VSI-Bench: We fine-tune our model on the VSI-590K dataset.
Image-based benchmarks: We use a composite training set aligned with VG-LLM.

Our processed annotations are available here. Please configure the local data and annotation paths in data/__init__.py before starting the training.

Training

1. Spa3R Pre-training

To train the Predictive Spatial Field Modeling (PSFM) framework from scratch:

export PYTHONPATH=.
python scripts/train_spa3r.py

2. Spa3-VLM Instruction Tuning

Set the pre-trained Spa3R path in the script: geometry_encoder_path=/path/to/spa3r.ckpt

bash scripts/train_vlm_sft.sh

Evaluation

We provide pre-trained weights for Spa3R and Spa3-VLM on VSI-590K on Hugging Face.

To evaluate Spa3-VLM on spatial reasoning benchmarks:

bash scripts/eval_vlm.sh

Citation

If you find our work helpful for your research, please consider starring this repository :star: and citing our work:

@article{Spa3R,
  title={Spa3R: Predictive Spatial Field Modeling for 3D Visual Reasoning},
  author={Haoyi Jiang and Liu Liu and Xinjie Wang and Yonghao He and Wei Sui and Zhizhong Su and Wenyu Liu and Xinggang Wang},
  journal={arXiv preprint arXiv:2602.21186},
  year=2026
}

License

This project is released under the MIT License.

Related Skills

node-connect

352.5k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

111.3k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

352.5k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

352.5k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。