<img width="2533" height="1688" alt="image" src="https://github.com/user-attachments/assets/24b1596b-2cd7-4c82-b056-889df7c0c231" /> Images generated by SR-DiT-B/1 (140M parameter diffusion model) with 400k training steps

SpeedrunDiT (SR-DiT): Speedrunning ImageNet Diffusion

This repository contains the reference implementation for SR-DiT (Speedrun Diffusion Transformer), a framework that combines representation alignment (REG-style), token routing (SPRINT), architectural improvements, and training modifications on top of a SiT-B/1 backbone with the INVAE tokenizer.

Links:

Code: https://github.com/SwayStar123/SpeedrunDiT
Checkpoints: https://huggingface.co/SwayStar123/SpeedrunDiT/tree/main
W&B runs: https://wandb.ai/kagaku-ai/REG/
Ablations (branches): https://github.com/SwayStar123/REG

Highlights

ImageNet-256 (400K iters, no CFG): FID 3.49, KDD 0.319, 140M params, sampling at NFE=250
ImageNet-512 (400K iters, no CFG): FID 4.23, KDD 0.306, sampling at NFE=250

SR-DiT builds on top of a strong baseline (REG + INVAE) and then progressively adds:

Semantic latent space via E2E-INVAE
SPRINT token routing
RMSNorm, RoPE, QK normalization, value residual learning
Contrastive Flow Matching (CFM)
Time shifting and balanced label sampling (for evaluation)

Repository layout

train.py: training loop (Accelerate)
generate.py: multi-GPU sampling to .png and .npz
evaluations/evaluator.py: computes FID/sFID/IS/Precision/Recall from .npz
preprocessing/dataset_tools.py: ImageNet preprocessing + INVAE encoding
train.sh, eval.sh: example scripts used for our runs

Setup

Create an environment (python 3.11) and install dependencies:

pip install -r requirements.txt

Dataset

Training expects a directory (passed via --data-dir) containing:

dataset/
  images/            # preprocessed ImageNet images (256x256 or 512x512)
  vae-in/            # INVAE latents (.npy) + dataset.json labels

Follow the preprocessing guide in preprocessing/README.md. The minimal flow is:

# 1) Convert raw ImageNet to resized/cropped PNG dataset
python preprocessing/dataset_tools.py convert --source /path/to/imagenet/train \
  --dest dataset/images --resolution=256x256 --transform=center-crop-dhariwal

# 2) Encode images to INVAE latents
python preprocessing/dataset_tools.py encode --source dataset/images \
  --dest dataset/vae-in

Preprocessed dataset is also uploaded here:

https://huggingface.co/datasets/SwayStar123/repa-imagenet-256/blob/main/dataset.zip
https://huggingface.co/datasets/SwayStar123/repa-imagenet-256/blob/main/vae-in.zip

You must first unzip the dataset.zip file, and then unzip the vae-in.zip inside the newly created dataset folder

Training

An example command is provided in train.sh:

bash train.sh

Key arguments:

--model: use SiT-B/1 for the SR-DiT-B/1 configuration
--data-dir: directory containing images/ and vae-in/
--qk-norm: enables QK normalization
--cfm-coeff, --cfm-weighting: CFM settings
--time-shifting, --shift-base: time shifting for training

Checkpoints are written to:

exps/<exp-name>/checkpoints/<step>.pt

Sampling and evaluation

eval.sh runs sampling (generate.py) and then computes metrics (evaluations/evaluator.py).

bash eval.sh

Notes:

generate.py currently supports --mode sde (the ode branch is not implemented).
For metric computation, download the matching reference batch listed in evaluations/README.md.
Balanced label sampling can be enabled via --balanced-sampling when generating samples.

Citation

If you use this repository, please cite SR-DiT:

@misc{bhanded2025speedrundit,
  title         = {Speedrunning ImageNet Diffusion},
  author        = {Bhanded, Swayam},
  year          = {2025},
  eprint        = {2512.12386},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url           = {https://arxiv.org/abs/2512.12386},
}

Contact

Please open a GitHub issue for any questions or issues.

Acknowledgements

This codebase builds upon:

REG / REPA
SiT
DINOv2
ADM evaluations
NVLabs edm2 preprocessing utilities

We gratefully acknowledge support from WayfarerLabs (Open World Labs) for sponsoring compute resources used in this work.

SpeedrunDiT

Install / Use

README