Spa3R
Spa3R: Predictive Spatial Field Modeling for 3D Visual Reasoning
Install / Use
/learn @hustvl/Spa3RREADME
🔮 Spaeing the Unseen
Spa3R: Predictive Spatial Field Modeling for 3D Visual Reasoning
Haoyi Jiang<sup>1</sup>, Liu Liu<sup>2</sup>, Xinjie Wang<sup>2</sup>, Yonghao He<sup>3</sup>,<br> Wei Sui<sup>3</sup>, Zhizhong Su<sup>2</sup>, Wenyu Liu<sup>1</sup>, Xinggang Wang<sup>1</sup><br> <sup>1</sup>Huazhong University of Science & Technology, <sup>2</sup>Horizon Robotics, <sup>3</sup>D-Robotics
</div>Installation
Please clone this project with --recursive.
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
pip install submodules/vggt
pip install -e submodules/lmms-eval
Data Preparation
1. Pre-training
We utilize a combination of large-scale indoor scene datasets: ScanNet and ScanNet++.
2. Instruction Tuning
- Video-centric VSI-Bench: We fine-tune our model on the VSI-590K dataset.
- Image-based benchmarks: We use a composite training set aligned with VG-LLM.
Our processed annotations are available here. Please configure the local data and annotation paths in data/__init__.py before starting the training.
Training
1. Spa3R Pre-training
To train the Predictive Spatial Field Modeling (PSFM) framework from scratch:
export PYTHONPATH=.
python scripts/train_spa3r.py
2. Spa3-VLM Instruction Tuning
Set the pre-trained Spa3R path in the script: geometry_encoder_path=/path/to/spa3r.ckpt
bash scripts/train_vlm_sft.sh
Evaluation
We provide pre-trained weights for Spa3R and Spa3-VLM on VSI-590K on Hugging Face.
To evaluate Spa3-VLM on spatial reasoning benchmarks:
bash scripts/eval_vlm.sh
Citation
If you find our work helpful for your research, please consider starring this repository :star: and citing our work:
@article{Spa3R,
title={Spa3R: Predictive Spatial Field Modeling for 3D Visual Reasoning},
author={Haoyi Jiang and Liu Liu and Xinjie Wang and Yonghao He and Wei Sui and Zhizhong Su and Wenyu Liu and Xinggang Wang},
journal={arXiv preprint arXiv:2602.21186},
year=2026
}
License
This project is released under the MIT License.
Related Skills
node-connect
352.5kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
111.3kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
352.5kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
352.5kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
