SmoothCache
Implementation of SmoothCache, a project aimed at speeding-up Diffusion Transformer (DiT) based GenAI models with error-guided caching.
Install / Use
/learn @Roblox/SmoothCacheREADME
SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers
<div align="center" style="line-height: 1;"> <a href="https://arxiv.org/pdf/2411.10510" target="_blank"><img src="https://img.shields.io/badge/ArXiv-Paper-b5212f.svg?logo=arxiv" height="22px"></a> <a href="https://github.com/Roblox/SmoothCache/blob/main/LICENSE"><img src="https://img.shields.io/badge/License-Apache_2.0-green" alt="License"></a> </div> <!-- <div align="center"> <img src="https://github.com/Roblox/SmoothCache/blob/main/assets/TeaserFigureFlat.png" width="100%" ></img> <br> <em> (Accelerating Diffusion Transformer inference across multiple modalities with 50 DDIM Steps on DiT-XL-256x256, 100 DPM-Solver++(3M) SDE steps for a 10s audio sample (spectrogram shown) on Stable Audio Open, 30 Rectified Flow steps on Open-Sora 480p 2s videos) </em> </div> <br> -->
Figure 1. Accelerating Diffusion Transformer inference across multiple modalities with 50 DDIM Steps on DiT-XL-256x256, 100 DPM-Solver++(3M) SDE steps for a 10s audio sample (spectrogram shown) on Stable Audio Open, 30 Rectified Flow steps on Open-Sora 480p 2s videos
Updates
Release v0.1
SmoothCache now supports generating cache schedues using a zero-intrusion external helper. See run_calibration.py to find out how it generates a schedule compatible with HuggingFace Diffusers DiTPipeline, without requiring any changes to Diffusers implementation!
Introduction
We introduce SmoothCache, a straightforward acceleration technique for DiT architecture models, that's both training-free, flexible and performant. By leveraging layer-wise representation error, our method identifies redundancies in the diffusion process, generates a static caching scheme to reuse output featuremaps and therefore reduces the need for computationally expensive operations. This solution works across different models and modalities, can be easily dropped into existing Diffusion Transformer pipelines, can be stacked on different solvers, and requires no additional training or datasets. SmoothCache consistently outperforms various solvers designed to accelerate the diffusion process, while matching or surpassing the performance of existing modality-specific caching techniques.

Quick Start
Install
pip install dit-smoothcache
Usage - Inference
Inspired by DeepCache, we have implemented drop-in SmoothCache helper classes that easily applies to Huggingface Diffuser DiTPipeline, and original DiT implementations.
Generally, only 3 additional lines needs to be added to the original sampler scripts:
from SmoothCache import <DESIREDCacheHelper>
cache_helper = DiffuserCacheHelper(<MODEL_HANDLER>, schedule=schedule)
cache_helper.enable()
# Original sampler code.
cache_helper.disable()
Usage example with Huggingface Diffuser DiTPipeline:
import json
import torch
from diffusers import DiTPipeline, DPMSolverMultistepScheduler
# Import SmoothCacheHelper
from SmoothCache import DiffuserCacheHelper
# Load the DiT pipeline and scheduler
pipe = DiTPipeline.from_pretrained("facebook/DiT-XL-2-256", torch_dtype=torch.float16)
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to("cuda")
# Initialize the DiffuserCacheHelper with the model
with open("smoothcache_schedules/50-N-3-threshold-0.35.json", "r") as f:
schedule = json.load(f)
cache_helper = DiffuserCacheHelper(pipe.transformer, schedule=schedule)
# Enable the caching helper
cache_helper.enable()
# Prepare the input
words = ["Labrador retriever"]
class_ids = pipe.get_label_ids(words)
# Generate images with the pipeline
generator = torch.manual_seed(33)
image = pipe(class_labels=class_ids, num_inference_steps=50, generator=generator).images[0]
# Restore the original forward method and disable the helper
# disable() should be paired up with enable()
cache_helper.disable()
Usage example with original DiT implementation
import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
from torchvision.utils import save_image
from diffusion import create_diffusion
from diffusers.models import AutoencoderKL
from download import find_model
from models import DiT_models
import argparse
from SmoothCache import DiTCacheHelper # Import DiTCacheHelper
import json
# Setup PyTorch:
torch.manual_seed(args.seed)
torch.set_grad_enabled(False)
device = "cuda" if torch.cuda.is_available() else "cpu"
if args.ckpt is None:
assert (
args.model == "DiT-XL/2"
), "Only DiT-XL/2 models are available for auto-download."
assert args.image_size in [256, 512]
assert args.num_classes == 1000
# Load model:
latent_size = args.image_size // 8
model = DiT_models[args.model](
input_size=latent_size, num_classes=args.num_classes
).to(device)
ckpt_path = args.ckpt or f"DiT-XL-2-{args.image_size}x{args.image_size}.pt"
state_dict = find_model(ckpt_path)
model.load_state_dict(state_dict)
model.eval() # important!
with open("smoothcache_schedules/50-N-3-threshold-0.35.json", "r") as f:
schedule = json.load(f)
cache_helper = DiTCacheHelper(model, schedule=schedule)
# number of timesteps should be consistent with provided schedules
diffusion = create_diffusion(str(len(schedule[cache_helper.components_to_wrap[0]])))
# Enable the caching helper
cache_helper.enable()
# Sample images:
samples = diffusion.p_sample_loop(
model.forward_with_cfg,
z.shape,
z,
clip_denoised=False,
model_kwargs=model_kwargs,
progress=True,
device=device,
)
samples, _ = samples.chunk(2, dim=0) # Remove null class samples
samples = vae.decode(samples / 0.18215).sample
# Disable the caching helper after sampling
cache_helper.disable()
# Save and display images:
save_image(samples, "sample.png", nrow=4, normalize=True, value_range=(-1, 1))
Usage - Cache Schedule Generation
See run_calibration.py, which generates schedule for the self-attention module (attn1) from Diffusers BasicTransformerBlock block.
Note that only self-attention, and not cross-attention, is enabled in the stock config of Diffusers DiT module. We leave this behavior as-is for the purpose of minimal intrusion.
We welcome all contributions aimed at expending SmoothCache's model coverage and module coverage.
Visualization
256x256 Image Generation Task

Evaluation
Image Generation with DiT-XL/2-256x256

Video Generation with OpenSora

Audio Generation with Stable Audio Open

License
SmoothCache is licensed under the Apache-2.0 license.
Bibtex
@InProceedings{Liu_2025_CVPR,
author = {Liu, Joseph and Geddes, Joshua and Guo, Ziyu and Jiang, Haomiao and Nandwana, Mahesh Kumar},
title = {SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers},
booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) Workshops},
month = {June},
year = {2025},
pages = {3229-3238}
}
Related Skills
node-connect
344.4kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
99.2kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
344.4kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
344.4kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
