[ICLR 2025] DiffSplat

DiffSplat: Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation

Chenguo Lin, Panwang Pan, Bangbang Yang, Zeming Li, Yadong Mu

This repository contains the official implementation of the paper: DiffSplat: Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation, which is accepted to ICLR 2025. DiffSplat is a generative framework to synthesize 3D Gaussian Splats from text prompts & single-view images in 1~2 seconds. It is fine-tuned directly from a pretrained text-to-image diffusion model.

Feel free to contact me (chenguolin@stu.pku.edu.cn) or open an issue if you have any questions or suggestions.

🔥 See Also

You may also be interested in our other works:

[CVPR 2026] Diff4Splat: a generative model for 4D dynamic scenes from a single-view image.
[CVPR 2026] MoVieS: a feed-forward model for 4D dynamic reconstruction from monocular videos.
[NeurIPS 2025] PartCrafter: a 3D-native DiT that can directly generate 3D objects in multiple parts.

📢 News

2025-03-06: Training instructions for DiffSplat and ControlNet are provided.
2025-02-11: Training instructions for GSRecon and GSVAE are provided.
2025-02-02: Inference instructions (text-conditioned & image-conditioned & controlnet) are provided.
2025-01-29: The source code and pretrained models are released. Happy 🐍 Chinese New Year 🎆!
2025-01-22: DiffSplat is accepted to ICLR 2025.

📋 TODO

[x] Provide detailed instructions for inference.
[x] Provide detailed instructions for GSRecon & GSVAE training.
[x] Provide detailed instructions for DiffSplat training.

🔧 Installation

You may need to modify the specific version of torch in settings/setup.sh according to your CUDA version. There are not restrictions on the torch version, feel free to use your preferred one.

git clone https://github.com/chenguolin/DiffSplat.git
cd DiffSplat
bash settings/setup.sh

📊 Dataset

We use G-Objaverse with about 265K 3D objects and 10.6M rendered images (265K x 40 views, including RGB, normal and depth maps) for GSRecon and GSVAE training. Its subset with about 83K 3D objects provided by LGM is used for DiffSplat training. Their text descriptions are provided by the latest version of Cap3D (i.e., refined by DiffuRank).
We find the filtering is crucial for the generation quality of DiffSplat, and a larger dataset is beneficial for the performance of GSRecon and GSVAE.
We store the dataset in an internal HDFS cluster in this project. Thus, the training code can NOT be directly run on your local machine. Please implement your own dataloading logic referring to our provided dataset & dataloader code.

🚀 Usage

📷 Camera Conventions

The camera and world coordinate systems in this project are both defined in the OpenGL convention, i.e., X: right, Y: up, Z: backward. The camera is located at (0, 0, 1.4) in the world coordinate system, and the camera looks at the origin (0, 0, 0). Please refer to kiuikit camera doc for visualizations of the camera and world coordinate systems.

🤗 Pretrained Models

All pretrained models are available at HuggingFace🤗.

| Model Name | Fine-tined From | #Param. | Link | Note | |-------------------------------|---------------------|-------------|----------|----------| | GSRecon | From scratch | 42M | gsrecon_gobj265k_cnp_even4 | Feed-forward reconstruct per-pixel 3DGS from 4-view (RGB, normal, coordinate) maps | | GSVAE (SD) | SD1.5 VAE | 84M | gsvae_gobj265k_sd | | | GSVAE (SDXL) | SDXL fp16 VAE | 84M | gsvae_gobj265k_sdxl_fp16 | fp16-fixed SDXL VAE is more robust | | GSVAE (SD3) | SD3 VAE | 84M | gsvae_gobj265k_sd3 | | | DiffSplat (SD1.5) | SD1.5 | 0.86B | Text-cond: gsdiff_gobj83k_sd15__render Image-cond: gsdiff_gobj83k_sd15_image__render | Best efficiency | | DiffSplat (PixArt-Sigma) | PixArt-Sigma | 0.61B | Text-cond: gsdiff_gobj83k_pas_fp16__render Image-cond: gsdiff_gobj83k_pas_fp16_image__render | Best Trade-off | | DiffSplat (SD3.5m) | SD3.5 median | 2.24B | Text-cond: gsdiff_gobj83k_sd35m__render Image-cond: gsdiff_gobj83k_sd35m_image__render | Best performance | | DiffSplat ControlNet (SD1.5) | From scratch | 361M | Depth: gsdiff_gobj83k_sd15__render__depth Normal: gsdiff_gobj83k_sd15__render__normal Canny: gsdiff_gobj83k_sd15__render__canny | | | (Optional) ElevEst | dinov2_vitb14_reg | 86 M | elevest_gobj265k_b_C25 | (Optional) Single-view image elevation estimation |

⚡ Inference

0. Download Pretrained Models

Note that:

Pretrained weights will download from HuggingFace and stored in ./out.
Other pretrained models (such as CLIP, T5, image VAE, etc.) will be downloaded automatically and stored in your HuggingFace cache directory.
If you face problems in visiting HuggingFace Hub, you can try to set the environment variable export HF_ENDPOINT=https://hf-mirror.com.
GSRecon pretrained weights is NOT really used during inference. Only its rendering function is used for visualization.

python3 download_ckpt.py --model_type [MODEL_TYPE] [--image_cond]

# `MODEL_TYPE`: choose from "sd15", "pas", "sd35m", "depth", "normal", "canny", "elevest"
# `--image_cond`: add this flag for downloading image-conditioned models

For example, to download the text-cond SD1.5-based DiffSplat:

python3 download_ckpt.py --model_type sd15

To download the image-cond PixArt-Sigma-based DiffSplat:

python3 download_ckpt.py --model_type pas --image_cond

1. Text-conditioned 3D Object Generation