<div align="center"> <h1> SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE </h1> <p align="center"> <a href="https://cyw-3d.github.io/projects/SAR3D"><img src="https://img.shields.io/badge/Project-Page-blue?style=for-the-badge&logo=googlechrome" height=25></a> <a href="https://arxiv.org/abs/2411.16856"><img src="https://img.shields.io/badge/arXiv-2411.16856-b31b1b?style=for-the-badge&logo=arxiv" height=25></a> </p>

Yongwei Chen¹ • Yushi Lan¹ • Shangchen Zhou¹ • Tengfei Wang² • Xingang Pan¹

¹S-lab, Nanyang Technological University
²Shanghai Artificial Intelligence Laboratory

CVPR 2025

https://github.com/user-attachments/assets/badac244-f8ee-41c2-8129-b09cf6404b91

</div>

🌟 Features

🔄 Autoregressive Modeling
⚡️ Ultra-fast 3D Generation (<1s)
🔍 Detailed Understanding

🛠️ Installation & Usage

Prerequisites

We've tested SAR3D on the following environment:

<details open> <summary><b>Ubuntu 20.04</b></summary>

Python 3.9.16
PyTorch 2.0.0
CUDA 11.7
NVIDIA A6000

</details>

Quick Start

Clone the repository

git clone https://github.com/cyw-3d/SAR3D.git
cd SAR3D

Set up environment

conda env create -f environment.yml

Download pretrained models 📥

The pretrained models will be automatically downloaded to the checkpoints folder during first run.

You can also manually download them from our model zoo:

| Model | Description | Link | |-------|-------------|------| | VQVAE | Base VQVAE model | vqvae-ckpt.pt | | VQVAE | Flexicubes VQVAE model | vqvae-flexicubes-ckpt.pt | | Generation | Image-conditioned model | image-condition-ckpt.pth | | Generation | Text-conditioned model | text-condition-ckpt.pth |

Run inference 🚀

To test the model on your own images:

Place your test images in the test_files/test_images folder
Run the inference script:

bash test_image.sh

To test the model on your own text prompts:

Place your test prompts in the test_files/test_text.json file
Run the inference script:

bash test_text.sh

📚 Training

Dataset

The dataset is available for download at Hugging Face.

The dataset consists of 8 splits containing preprocessed data based on G-buffer Objaverse, including:

Rendered images
Depth maps
Camera poses
Text descriptions
Normal maps
Latent embeddings

The dataset covers over 170K unique 3D objects, augmented to more than 630K data pairs. A data.json file is provided that maps object IDs to their corresponding categories.

After downloading and unzipping the dataset, you should have the following structure:

/dataset-root/
├── 1/
├── 2/
├── ...
├── 8/
│   └── 0/
│       ├── raw_image.png
│       ├── depth_alpha.jpg
│       ├── c.npy
│       ├── caption_3dtopia.txt
│       ├── normal.png
│       ├── ...
│       └── image_dino_embedding_lrm.npy
└── dataset.json

Training Commands

The following scripts allow you to train both image-conditioned and text-conditioned models using the dataset stored in the specified <DATA_DIR> location.

For image-conditioned model training:

bash train_image.sh <MODEL_DEPTH> <BATCH_SIZE> <GPU_NUM> <VQVAE_PATH> <OUT_DIR> <DATA_DIR>

For text-conditioned model training:

bash train_text.sh <MODEL_DEPTH> <BATCH_SIZE> <GPU_NUM> <VQVAE_PATH> <OUT_DIR> <DATA_DIR>

For VQVAE training

bash train_VQVAE.sh <DATA_DIR> <GPU_NUM> <BATCH_SIZE> <OUT_DIR>

📋 Roadmap

[x] Inference and Training Code for Image-conditioned Generation
[x] Dataset Release
[x] Inference Code for Text-conditioned Generation
[x] Training Code for Text-conditioned Generation
[x] VQVAE training code
[x] Code for Understanding

📝 Citation

If you find this work useful for your research, please cite our paper:

@inproceedings{chen2024sar3d,
    title={SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE},
    author={Chen, Yongwei and Lan, Yushi and Zhou, Shangchen and Wang, Tengfei and Pan, Xingang},
    booktitle={CVPR},
    year={2025}
}

SAR3D

Install / Use

README