DiffSplat
[ICLR 2025] Official implementation of "DiffSplat: Repurposing Image Diffusion Models for Scalable 3D Gaussian Splat Generation".
Install / Use
/learn @chenguolin/DiffSplatREADME
[ICLR 2025] DiffSplat
<h4 align="center">DiffSplat: Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation
Chenguo Lin, Panwang Pan, Bangbang Yang, Zeming Li, Yadong Mu
<p> <img width="144" src="./assets/_demo/1.gif"> <img width="144" src="./assets/_demo/2.gif"> <img width="144" src="./assets/_demo/3.gif"> <img width="144" src="./assets/_demo/4.gif"> <img width="144" src="./assets/_demo/5.gif"> </p> <p> <img width="144" src="./assets/_demo/6.gif"> <img width="144" src="./assets/_demo/7.gif"> <img width="144" src="./assets/_demo/8.gif"> <img width="144" src="./assets/_demo/9.gif"> <img width="144" src="./assets/_demo/10.gif"> </p> <p> <img width="730", src="./assets/_demo/overview.png"> </p> </h4>This repository contains the official implementation of the paper: DiffSplat: Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation, which is accepted to ICLR 2025. DiffSplat is a generative framework to synthesize 3D Gaussian Splats from text prompts & single-view images in 1~2 seconds. It is fine-tuned directly from a pretrained text-to-image diffusion model.
Feel free to contact me (chenguolin@stu.pku.edu.cn) or open an issue if you have any questions or suggestions.
🔥 See Also
You may also be interested in our other works:
- [CVPR 2026] Diff4Splat: a generative model for 4D dynamic scenes from a single-view image.
- [CVPR 2026] MoVieS: a feed-forward model for 4D dynamic reconstruction from monocular videos.
- [NeurIPS 2025] PartCrafter: a 3D-native DiT that can directly generate 3D objects in multiple parts.
📢 News
- 2025-03-06: Training instructions for DiffSplat and ControlNet are provided.
- 2025-02-11: Training instructions for GSRecon and GSVAE are provided.
- 2025-02-02: Inference instructions (text-conditioned & image-conditioned & controlnet) are provided.
- 2025-01-29: The source code and pretrained models are released. Happy 🐍 Chinese New Year 🎆!
- 2025-01-22: DiffSplat is accepted to ICLR 2025.
📋 TODO
- [x] Provide detailed instructions for inference.
- [x] Provide detailed instructions for GSRecon & GSVAE training.
- [x] Provide detailed instructions for DiffSplat training.
🔧 Installation
You may need to modify the specific version of torch in settings/setup.sh according to your CUDA version.
There are not restrictions on the torch version, feel free to use your preferred one.
git clone https://github.com/chenguolin/DiffSplat.git
cd DiffSplat
bash settings/setup.sh
📊 Dataset
- We use G-Objaverse with about 265K 3D objects and 10.6M rendered images (265K x 40 views, including RGB, normal and depth maps) for
GSReconandGSVAEtraining. Its subset with about 83K 3D objects provided by LGM is used forDiffSplattraining. Their text descriptions are provided by the latest version of Cap3D (i.e., refined by DiffuRank). - We find the filtering is crucial for the generation quality of
DiffSplat, and a larger dataset is beneficial for the performance ofGSReconandGSVAE. - We store the dataset in an internal HDFS cluster in this project. Thus, the training code can NOT be directly run on your local machine. Please implement your own dataloading logic referring to our provided dataset & dataloader code.
🚀 Usage
📷 Camera Conventions
The camera and world coordinate systems in this project are both defined in the OpenGL convention, i.e., X: right, Y: up, Z: backward. The camera is located at (0, 0, 1.4) in the world coordinate system, and the camera looks at the origin (0, 0, 0).
Please refer to kiuikit camera doc for visualizations of the camera and world coordinate systems.
🤗 Pretrained Models
All pretrained models are available at HuggingFace🤗.
| Model Name | Fine-tined From | #Param. | Link | Note | |-------------------------------|---------------------|-------------|----------|----------| | GSRecon | From scratch | 42M | gsrecon_gobj265k_cnp_even4 | Feed-forward reconstruct per-pixel 3DGS from 4-view (RGB, normal, coordinate) maps | | GSVAE (SD) | SD1.5 VAE | 84M | gsvae_gobj265k_sd | | | GSVAE (SDXL) | SDXL fp16 VAE | 84M | gsvae_gobj265k_sdxl_fp16 | fp16-fixed SDXL VAE is more robust | | GSVAE (SD3) | SD3 VAE | 84M | gsvae_gobj265k_sd3 | | | DiffSplat (SD1.5) | SD1.5 | 0.86B | Text-cond: gsdiff_gobj83k_sd15__render<br> Image-cond: gsdiff_gobj83k_sd15_image__render | Best efficiency | | DiffSplat (PixArt-Sigma) | PixArt-Sigma | 0.61B | Text-cond: gsdiff_gobj83k_pas_fp16__render<br> Image-cond: gsdiff_gobj83k_pas_fp16_image__render | Best Trade-off | | DiffSplat (SD3.5m) | SD3.5 median | 2.24B | Text-cond: gsdiff_gobj83k_sd35m__render<br> Image-cond: gsdiff_gobj83k_sd35m_image__render | Best performance | | DiffSplat ControlNet (SD1.5) | From scratch | 361M | Depth: gsdiff_gobj83k_sd15__render__depth<br> Normal: gsdiff_gobj83k_sd15__render__normal<br> Canny: gsdiff_gobj83k_sd15__render__canny | | | (Optional) ElevEst | dinov2_vitb14_reg | 86 M | elevest_gobj265k_b_C25 | (Optional) Single-view image elevation estimation |
⚡ Inference
0. Download Pretrained Models
Note that:
- Pretrained weights will download from HuggingFace and stored in
./out. - Other pretrained models (such as CLIP, T5, image VAE, etc.) will be downloaded automatically and stored in your HuggingFace cache directory.
- If you face problems in visiting HuggingFace Hub, you can try to set the environment variable
export HF_ENDPOINT=https://hf-mirror.com. GSReconpretrained weights is NOT really used during inference. Only its rendering function is used for visualization.
python3 download_ckpt.py --model_type [MODEL_TYPE] [--image_cond]
# `MODEL_TYPE`: choose from "sd15", "pas", "sd35m", "depth", "normal", "canny", "elevest"
# `--image_cond`: add this flag for downloading image-conditioned models
For example, to download the text-cond SD1.5-based DiffSplat:
python3 download_ckpt.py --model_type sd15
To download the image-cond PixArt-Sigma-based DiffSplat:
python3 download_ckpt.py --model_type pas --image_cond
1. Text-conditioned 3D Object Generation
Note that:
- Model differences may not be significant for simple text prompts. We recommend using
DiffSplat (SD1.5)for better efficiency,DiffSplat (SD3.5m)for better performance, andDiffSplat (PixArt-Sigma)for a better trade-off. - By default,
export HF_HOME=~/.cache/huggingface,export TORCH_HOME=~/.cache/torch. You can change these paths inscripts/infer.sh. SD3-related models require HuggingFace token for downloading, which is expected to be stored inHF_HOME. - Outputs will be stored in
./out/<MODEL_NAME>/inference. - Prompt is specified by
--prompt(e.g.
