SpatialLM
[NeurIPS 2025] SpatialLM: Training Large Language Models for Structured Indoor Modeling
Install / Use
/learn @manycore-research/SpatialLMREADME
SpatialLM
<!-- markdownlint-disable first-line-h1 --> <!-- markdownlint-disable html --> <!-- markdownlint-disable no-duplicate-header --> <div align="center"> <img src="figures/logo_light.png#gh-light-mode-only" width="60%" alt="SpatialLM" /> <img src="figures/logo_dark.png#gh-dark-mode-only" width="60%" alt="SpatialLM" /> </div> <hr style="margin-top: 0; margin-bottom: 8px;"> <div align="center" style="margin-top: 0; padding-top: 0; line-height: 1;"> <a href="https://manycore-research.github.io/SpatialLM" target="_blank" style="margin: 2px;"><img alt="Project" src="https://img.shields.io/badge/🌐%20Website-SpatialLM-ffc107?color=42a5f5&logoColor=white" style="display: inline-block; vertical-align: middle;"/></a> <a href="https://arxiv.org/abs/2506.07491" target="_blank" style="margin: 2px;"><img alt="arXiv" src="https://img.shields.io/badge/arXiv-Techreport-b31b1b?logo=arxiv&logoColor=white" style="display: inline-block; vertical-align: middle;"/></a> <a href="https://github.com/manycore-research/SpatialLM" target="_blank" style="margin: 2px;"><img alt="GitHub" src="https://img.shields.io/badge/GitHub-SpatialLM-24292e?logo=github&logoColor=white" style="display: inline-block; vertical-align: middle;"/></a> </div> <div align="center" style="line-height: 1;"> <a href="https://huggingface.co/manycore-research/SpatialLM1.1-Qwen-0.5B" target="_blank" style="margin: 2px;"><img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-SpatialLM-ffc107?color=ffc107&logoColor=white" style="display: inline-block; vertical-align: middle;"/></a> <a href="https://huggingface.co/datasets/manycore-research/SpatialLM-Dataset" target="_blank" style="margin: 2px;"><img alt="Dataset" src="https://img.shields.io/badge/%F0%9F%A4%97%20Dataset-Dataset-ffc107?color=ffc107&logoColor=white" style="display: inline-block; vertical-align: middle;"/></a> <a href="https://huggingface.co/datasets/manycore-research/SpatialLM-Testset" target="_blank" style="margin: 2px;"><img alt="Dataset" src="https://img.shields.io/badge/%F0%9F%A4%97%20Dataset-Testset-ffc107?color=ffc107&logoColor=white" style="display: inline-block; vertical-align: middle;"/></a> </div>✨ News
- [Sept, 2025] SpatialLM-Dataset is now available on Hugging Face.
- [Sept, 2025] SpatialLM accepted at NeurIPS 2025.
- [Jun, 2025] Added finetuning instructions in FINETUNE.md.
- [Jun, 2025] Check out our new models: SpatialLM1.1-Llama-1B and SpatialLM1.1-Qwen-0.5B, now available on Hugging Face. SpatialLM1.1 doubles the point cloud resolution, incorporates a more powerful point cloud encoder Sonata and supports detection with user-specified categories.
- [Jun, 2025] SpatialLM Technical Report is now on arXiv.
- [Mar, 2025] We're excited to release the SpatialLM-Llama-1B and SpatialLM-Qwen-0.5B on Hugging Face.
- [Mar, 2025] Initial release of SpatialLM!
Introduction
SpatialLM is a 3D large language model designed to process 3D point cloud data and generate structured 3D scene understanding outputs. These outputs include architectural elements like walls, doors, windows, and oriented object bounding boxes with their semantic categories. Unlike previous methods that require specialized equipment for data collection, SpatialLM can handle point clouds from diverse sources such as monocular video sequences, RGBD images, and LiDAR sensors. This multimodal architecture effectively bridges the gap between unstructured 3D geometric data and structured 3D representations, offering high-level semantic understanding. It enhances spatial reasoning capabilities for applications in embodied robotics, autonomous navigation, and other complex 3D scene analysis tasks.
<div align="center"> <video src="https://github.com/user-attachments/assets/c0218d6a-f676-41f8-ae76-bba228866306" poster="figures/cover.png"> </video> <p><i>SpatialLM reconstructs 3D layout from a monocular RGB video with MASt3R-SLAM. Results aligned to video with GT cameras for visualization.</i></p> </div>SpatialLM Models
<div align="center">| Model | Download | | :--------------------: | --------------------------------------------------------------------------------- | | SpatialLM1.1-Llama-1B | 🤗 HuggingFace | | SpatialLM1.1-Qwen-0.5B | 🤗 HuggingFace | | SpatialLM1.0-Llama-1B | 🤗 HuggingFace | | SpatialLM1.0-Qwen-0.5B | 🤗 HuggingFace |
</div>Usage
Installation
Tested with the following environment:
- Python 3.11
- Pytorch 2.4.1
- CUDA Version 12.4
# clone the repository
git clone https://github.com/manycore-research/SpatialLM.git
cd SpatialLM
# create a conda environment with cuda 12.4
conda create -n spatiallm python=3.11
conda activate spatiallm
conda install -y -c nvidia/label/cuda-12.4.0 cuda-toolkit conda-forge::sparsehash
# Install dependencies with poetry
pip install poetry && poetry config virtualenvs.create false --local
poetry install
# SpatialLM1.0 dependency
poe install-torchsparse # Building wheel for torchsparse will take a while
# SpatialLM1.1 dependency
poe install-sonata # Building wheel for flash-attn will take a while
Inference
In the current version of SpatialLM, input point clouds are considered axis-aligned where the z-axis is the up axis. This orientation is crucial for maintaining consistency in spatial understanding and scene interpretation across different datasets and applications. Example preprocessed point clouds, reconstructed from RGB videos using MASt3R-SLAM, are available in SpatialLM-Testset.
Download an example point cloud:
huggingface-cli download manycore-research/SpatialLM-Testset pcd/scene0000_00.ply --repo-type dataset --local-dir .
Run inference:
python inference.py --point_cloud pcd/scene0000_00.ply --output scene0000_00.txt --model_path manycore-research/SpatialLM1.1-Qwen-0.5B
Detection with user-specified categories
SpatialLM1.1 supports object detection conditioned on user-specified categories by leveraging the flexibility of LLMs.
SpatialLM1.1 offers three variants of structured indoor modeling tasks:
- Structured Reconstruction: Detect walls, doors, windows, boxes.
- Layout Estimation: Detect walls, doors, windows.
- 3D Object Detection: Detect boxes.
For tasks that include object box estimation, you can specify a subset of the 59 furniture categories, and the model will only predict objects within those specified categories. For example:
python inference.py --point_cloud pcd/scene0000_00.ply --output scene0000_00.txt --model_path manycore-research/SpatialLM1.1-Qwen-0.5B --detect_type object --category bed nightstand
Visualization
Use rerun to visualize the point cloud and the predicted structured 3D layout output:
# Convert the predicted layout to Rerun format
python visualize.py --point_cloud pcd/scene0000_00.ply --layout scene0000_00.txt --save scene0000_00.rrd
# Visualize the point cloud and the predicted layout
rerun scene0000_00.rrd
Evaluation
To evaluate the performance of SpatialLM, we provide eval.py script that reports the benchmark results on the SpatialLM-Testset in the table below in section Benchmark Results.
Download the testset:
huggingface-cli download manycore-research/SpatialLM-Testset --repo-type dataset --local-dir SpatialLM-Testset
Run evaluation:
# Run inference on the PLY point clouds in folder SpatialLM-Testset/pcd with SpatialLM1.1-Qwen-0.5B model
python inference.py --point_cloud SpatialLM-Testset/pcd --output SpatialLM-Testset/pred --model_path manycore-research/SpatialLM1.1-Qwen-0.5B
# Evaluate the predicted layouts
python eval.py --metadata SpatialLM-Testset/test.csv --gt_dir SpatialLM-Testset/layout --pred_dir SpatialLM-Testset/pred --label_mapping SpatialLM-Testset/benchmark_categories.tsv
Example using a custom video
We provide an example of how to use our model to estimate scene layout starting from a RGB video with the newly released SLAM3R in EXAMPLE.md. These steps work for MASt3R-SLAM, and other reconstruction methods as well.
Finetune on Custom Data
For instructions on fine-tuning SpatialLM on your own data, please refer to FINETUNE.md. We provide an example using the ARKitScenes dataset.
SpatialLM Dataset
The SpatialLM dataset is a large-scale, high-quality synthetic dataset designed by professional 3D designers and used for real-world production. It contains point clouds from 12,328 diverse indoor scenes comprising 54,778 rooms, each paired with rich ground-truth 3D annotations. SpatialLM dataset provides an additional valuable resource for advancing research in indoor scene understanding, 3D perception, and related applications.
For access to photorealistic RGB/Depth/Normal/Semantic/Instance panoramic renderings and camera trajectories used to generate the SpatialLM point clouds, please refer to the SpatialGen project for more details.
<div align="center">| *
