SIU3R
[NeurIPS 2025 Spotlight] Official implementation of the SIU3R: Simultaneous Scene Understanding and 3D Reconstruction Beyond Feature Alignment
Install / Use
/learn @WU-CVGL/SIU3RREADME
This repository is the official implementation of the SIU3R.
SIU3R is a feed-forward method that can achieve simultaneous 3D scene understanding and reconstruction given unposed images. In particular, SIU3R does not require feature alignment with 2D VFMs (e.g., CLIP, LSeg) to enable understanding, which unleashes its potential as a unified model to achieve multiple 3D understanding tasks (i.e., semantic, instance, panoptic and text-referred segmentation). Moreover, tailored designs for mutual benefits can further boost SIU3R's performance by encouraging bi-directional promotion between reconstruction and understanding.
</div> <br>https://github.com/user-attachments/assets/95034781-75e4-4317-ab34-a9ea4ed7a644
📰 News
- [2025-09-19] Our code is now released! 🎉
- [2025-09-18] Our paper is accepted by NeurIPS 2025 as a Spotlight paper! 🌟
- [2025-07-03] Our paper is available on arXiv! 🎉 Paper
🛠️ Installation
We recommend using uv to create a virtual environment for this project. The following instructions assume you have uv installed. Our code is tested with Python 3.10 and PyTorch 2.4.1 with cuda 11.8.
To set up the environment, just run uv sync command.
⚡️ Inference
To run inference, you can download the pre-trained model from here and place it in the pretrained_weights directory.
Then, you can run the inference script:
python inference.py --image_path1 <path_to_image1> --image_path2 <path_to_image2> --output_path <output_directory> [--cx <cx_value>] [--cy <cy_value>] [--fx <fx_value>] [--fy <fy_value>]
A output.ply will be generated in the specified output directory, containing the reconstructed gaussian splattings. The cx, cy, fx, and fy parameters are optional and can be used to specify the camera intrinsics. If not provided, default values will be used.
You can view the results in the online viewer by running:
python viewer.py --output_ply <output_directory/output.ply>
📚 Dataset
We use the ScanNet dataset for training and evaluation. You can download the processed dataset from here and place it in the data directory. The dataset should have the following structure:
data/
├── scannet/
│ ├── train/
| | |-- scene0000_00
| | | |-- color
| | | |-- depth
| | | |-- extrinsic
| | | |-- instance
| | | |-- intrinsic.txt
| | | |-- iou.png
| | | |-- iou.pt
| | | |-- panoptic
| | | `-- semantic
| | `-- ....
| └── val/
│ ├── scene0011_00
│ │ |-- color
│ │ |-- depth
│ │ |-- extrinsic
│ │ |-- instance
│ │ |-- intrinsic.txt
│ │ |-- iou.png
│ │ |-- iou.pt
│ │ |-- panoptic
│ │ `-- semantic
│ `-- ....
|-- train_refer_seg_data.json
|-- val_pair.json
|-- val_refer_pair.json
`-- val_refer_seg_data.json
📝 Training
If you want to train the model, you should download pretrained MASt3R weights from here, our pretrained panoptic segmentation head weights from here and put them in the pretrained_weights directory.
To train the model, you can use the following command:
python src/run.py experiment=siu3r_train
This will start the training process using the configuration specified in configs/main.yaml. You can modify the configuration file to adjust the training parameters, such as devices, learning rate, batch size, and number of epochs.
📐 Evaluation
To evaluate the model, you can use the following command:
python src/run.py experiment=siu3r_test mode=test ckpt_path={your_ckpt_path}
This will start the evaluate process, which will load scannet validation set and generate nvs and segmentation results for pairs defined in val_pair.json. After that, evaluator will calculate metrics and write into json file.
📷 Camera Conventions
Our camera system is the same as pixelSplat. The camera intrinsic matrices are normalized (the first row is divided by image width, and the second row is divided by image height). The camera extrinsic matrices are OpenCV-style camera-to-world matrices ( +X right, +Y down, +Z camera looks into the screen).
📖 Citation
If you find our work useful, please consider citing our paper:
@misc{xu2025siu3r,
title={SIU3R: Simultaneous Scene Understanding and 3D Reconstruction Beyond Feature Alignment},
author={Qi Xu and Dongxu Wei and Lingzhe Zhao and Wenpu Li and Zhangchi Huang and Shunping Ji and Peidong Liu},
year={2025},
eprint={2507.02705},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2507.02705},
}
Related Skills
node-connect
347.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
107.8kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
347.0kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
347.0kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
