SAR3D
Official repository for "SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE"
Install / Use
/learn @cyw-3d/SAR3DREADME
Yongwei Chen¹ • Yushi Lan¹ • Shangchen Zhou¹ • Tengfei Wang² • Xingang Pan¹
¹S-lab, Nanyang Technological University
²Shanghai Artificial Intelligence Laboratory
CVPR 2025
https://github.com/user-attachments/assets/badac244-f8ee-41c2-8129-b09cf6404b91
</div>🌟 Features
- 🔄 Autoregressive Modeling
- ⚡️ Ultra-fast 3D Generation (<1s)
- 🔍 Detailed Understanding
🛠️ Installation & Usage
Prerequisites
We've tested SAR3D on the following environment:
<details open> <summary><b>Ubuntu 20.04</b></summary>- Python 3.9.16
- PyTorch 2.0.0
- CUDA 11.7
- NVIDIA A6000
Quick Start
- Clone the repository
git clone https://github.com/cyw-3d/SAR3D.git
cd SAR3D
- Set up environment
conda env create -f environment.yml
- Download pretrained models 📥
The pretrained models will be automatically downloaded to the checkpoints folder during first run.
You can also manually download them from our model zoo:
| Model | Description | Link | |-------|-------------|------| | VQVAE | Base VQVAE model | vqvae-ckpt.pt | | VQVAE | Flexicubes VQVAE model | vqvae-flexicubes-ckpt.pt | | Generation | Image-conditioned model | image-condition-ckpt.pth | | Generation | Text-conditioned model | text-condition-ckpt.pth |
- Run inference 🚀
To test the model on your own images:
- Place your test images in the
test_files/test_imagesfolder - Run the inference script:
bash test_image.sh
To test the model on your own text prompts:
- Place your test prompts in the
test_files/test_text.jsonfile - Run the inference script:
bash test_text.sh
📚 Training
Dataset
The dataset is available for download at Hugging Face.
The dataset consists of 8 splits containing preprocessed data based on G-buffer Objaverse, including:
- Rendered images
- Depth maps
- Camera poses
- Text descriptions
- Normal maps
- Latent embeddings
The dataset covers over 170K unique 3D objects, augmented to more than 630K data pairs. A data.json file is provided that maps object IDs to their corresponding categories.
After downloading and unzipping the dataset, you should have the following structure:
/dataset-root/
├── 1/
├── 2/
├── ...
├── 8/
│ └── 0/
│ ├── raw_image.png
│ ├── depth_alpha.jpg
│ ├── c.npy
│ ├── caption_3dtopia.txt
│ ├── normal.png
│ ├── ...
│ └── image_dino_embedding_lrm.npy
└── dataset.json
Training Commands
The following scripts allow you to train both image-conditioned and text-conditioned models using the dataset stored in the specified <DATA_DIR> location.
For image-conditioned model training:
bash train_image.sh <MODEL_DEPTH> <BATCH_SIZE> <GPU_NUM> <VQVAE_PATH> <OUT_DIR> <DATA_DIR>
For text-conditioned model training:
bash train_text.sh <MODEL_DEPTH> <BATCH_SIZE> <GPU_NUM> <VQVAE_PATH> <OUT_DIR> <DATA_DIR>
For VQVAE training
bash train_VQVAE.sh <DATA_DIR> <GPU_NUM> <BATCH_SIZE> <OUT_DIR>
📋 Roadmap
- [x] Inference and Training Code for Image-conditioned Generation
- [x] Dataset Release
- [x] Inference Code for Text-conditioned Generation
- [x] Training Code for Text-conditioned Generation
- [x] VQVAE training code
- [x] Code for Understanding
📝 Citation
If you find this work useful for your research, please cite our paper:
@inproceedings{chen2024sar3d,
title={SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE},
author={Chen, Yongwei and Lan, Yushi and Zhou, Shangchen and Wang, Tengfei and Pan, Xingang},
booktitle={CVPR},
year={2025}
}
