D2O
Official implemention for Diffusion Models Are Innate One-Step Generators
Install / Use
/learn @Zyriix/D2OREADME
📢 News: Our revised paper has been accepted to ICML 2025! This repository is now linked to the latest version of the paper under the new name D2O. The underlying codebase has not changed, so feel free to explore and use it!
D2O - Official Pytorch implementation 
Revisiting Diffusion Models: From Generative Pre-training to One-Step Generation
Original version on arXiv: [arXiv:2405.20750]
Accepted version (ICML 2025) on arXiv: [arXiv:2506.09376]
Bowen Zheng, Tianming Yang
Diffusion distillation is a widely used technique to reduce the sampling cost of diffusion models, yet it often requires extensive training, and the student performance tends to be degraded. Re- cent studies show that incorporating a GAN ob- jective may alleviate these issues, yet the under- lying mechanism remains unclear. In this work, we first identify a key limitation of distillation: mismatched step sizes and parameter numbers be- tween the teacher and the student model lead them to converge to different local minima, rendering direct imitation suboptimal. We further demon- strate that a standalone GAN objective, without relying a distillation loss, overcomes this limita- tion and is sufficient to convert diffusion models into efficient one-step generators. Based on this finding, we propose that diffusion training may be viewed as a form of generative pre-training, equip- ping models with capabilities that can be unlocked through lightweight GAN fine-tuning. Supporting this view, we create a one-step generation model by fine-tuning a pre-trained model with 85% of parameters frozen, achieving strong performance with only 0.2M images and near-SOTA results with 5M images. We further present a frequency- domain analysis that may explain the one-step generative capability gained in diffusion training. Overall, our work provides a new perspective for diffusion training, highlighting its role as a power- ful generative pre-training process, which can be the basis for building efficient one-step generation models.
Related Repositories
The references for computing FID are from EDM, and a large portion of codes in this repo is based on EDM and StyleGAN2-ADA.
Prepare Environments
conda create -n D2O python=3.9
Use conda instead of pip to install TensorFlow; otherwise, the GPU driver will not be found.
conda install tensorflow-gpu
Manually install torch to avoid conflicts.
pip install torch==1.12.1+cu116 torchvision==0.13.1+cu116 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu116
For training and validation
pip install -r requirements.txt
If you are only interested im validation or inference.
pip install -r requirements_validation.txt
Training
NOTE: To release the training code as quickly as possible, we’ve modified and removed a large portion of irrelevant or experimental code for clarity. We will continuously check the results but there might be some potential issues. If you have any problems, please open an issue, and we will reply ASAP.
Download Diffusion Checkpoints
Follow EDM's guaidance, and put the checkpoints into D2O/pretrained, e.g., D2O/pretrained/edm-afhqv2-64x64-uncond-ve.pkl.
Prepare the datasets
Follow EDM's guaidance, and put the datasets into D2O/datasets, e.g., D2O/datasets/cifar10-32x32.zip
Running
All the training commands are available in train.sh. You may modify the parameters according to your needs.
sh train.sh
Validation
Download Checkpoints
Download from Google Drive, and put the model to D2O/all_ckpt. E.g., D2O/all_ckpt/cifar_uncond_D2O_i.pkl
FID
sh validation_fid.sh
Inception Score & Precision/Recall
It is strongly recommended NOT to run this on A100 since it will be extremely slow for unknown reasons.
NOTE, there is a trade off between FID and other metrics during training, checkpoints are with lowest FID.
For computing precision/recall on imagenet 64x64, download ref batches from guided-diffusion and put it into results/imagenet.
Then
cd evaluations
sh validate_is_pr
CLIP-FID Metrics
Potential data leakage in FID when using a discriminator pre-trained on ImageNet has been a concern (The Role of ImageNet Classes in Fréchet Inception Distance). We provide CLIP-FID in Table \ref{tab:CLIP}. Our method consistently shows superior or competitive performance with significantly less training data. The result indicates that the superior performance is not due to information leakage but the pretrained ability gained in diffusion training.
|Dataset|Model|CLIP-FID|FID|Training Img |-|-|-|-|-| |CIFAR10|EDM|0.53|1.98|| ||CD|1.26|4.10|~100M|1| ||SiD|0.65|1.92|~400M|1| ||D2O-F|0.66|1.56|~5M| |FFHQ|EDM|1.18|2.39|| ||SiD|0.80|1.55|~500M|1| ||D2O-F|0.81|0.83|~9M| |AFHQv2|EDM|0.40|1.96|| ||SiD|0.32|1.62|~300M|1| ||D2O-F|0.18|1.24|~7M| |ImageNet|EDM|0.82|2.64|| ||CD|2.93|6.87|~1000M|1| ||SiD|0.75|1.52|~930M|1| ||D2O-F|0.51|1.13|~6M|
Citation
If you find our work useful, please consider citing our work:
@misc{zheng2025revisitingdiffusionmodelsgenerative, title={Revisiting Diffusion Models: From Generative Pre-training to One-Step Generation}, author={Bowen Zheng and Tianming Yang}, year={2025}, eprint={2506.09376}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2506.09376}, }
