SFD
[CVPR 2026] Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion
Install / Use
/learn @yuemingPAN/SFDREADME
✨ Highlights
- We propose Semantic-First Diffusion (SFD), a novel latent diffusion paradigm that performs asynchronous denoising on semantic and texture latents, allowing semantics to denoise earlier and subsequently guide texture generation.
- SFD achieves state-of-the-art FID score of 1.04 on ImageNet 256×256 generation.
- Exhibits 100× and 33.3× faster training convergence compared to DiT and LightningDiT, respectively.
🚩 Overview
<p align="center"> <img src="images/teaser_v5.png" width="90%"> </p>Latent Diffusion Models (LDMs) inherently follow a coarse-to-fine generation process, where high-level semantic structure is generated slightly earlier than fine-grained texture. This indicates the preceding semantics potentially benefit the texture generation by providing a semantic anchor. However, existing methods denoise semantic and texture latents synchronously, overlooking this natural ordering.
We propose Semantic-First Diffusion (SFD), a latent diffusion paradigm that explicitly prioritizes semantic formation. SFD constructs composite latents by combining compact semantic representations from a pretrained visual encoder (via a Semantic VAE) with texture latents, and performs asynchronous denoising with separate noise schedules: semantics denoise earlier to guide texture refinement. During denoising, SFD operates in three phases: Stage I – Semantic initialization, where semantic latents denoise first; Stage II – Asynchronous generation, where semantics and textures denoise jointly but asynchronously, with semantics ahead of textures; Stage III – Texture completion, where only textures continue refining. After denoising, only the texture latent is decoded for the final image.
On ImageNet 256×256, SFD demonstrates both superior quality and remarkable convergence acceleration. SFD achieves state-of-the-art FID 1.06 (LightningDiT-XL) and FID 1.04 (1.0B LightningDiT-XXL), while exhibiting approximately 100× and 33.3× faster training convergence compared to DiT and LightningDiT, respectively. SFD also improves existing methods like ReDi and VA-VAE, demonstrating the effectiveness of asynchronous, semantics-led modeling.
🗞️ News
- [2026.02.21] SFD is accepted by CVPR2026🎉🎉🎉.
- [2025.12.05] Released inference code and pre-trained model weights of SFD on ImageNet 256×256.
- [2025.12.21] Released training code of Semantic VAE and diffusion model (SFD).
🛠️ To-Do List
- [x] Inference code and model weights
- [x] Training code of Semantic VAE and diffusion model (SFD)
🧾 Results
Explicitly leading semantics ahead of textures with a moderate offset (Δt = 0.3) achieves an optimal balance between early semantic stabilization and texture collaboration, effectively harmonizing their joint modeling.
<p align="center"> <img src="images/fid_vs_delta_t.png" width="55%"> </p>- On ImageNet 256×256, SFD achieves FID 1.06 (LightningDiT-XL) and FID 1.04 (1.0B LightningDiT-XXL).
- 100× and 33.3× faster training convergence compared to DiT and LightningDiT, respectively.
🎯 Inference with Pre-Trained Model Weights
1. Prepare Environments
conda create -n sfd python=3.10.12
conda activate sfd
pip install -r requirements.txt
pip install numpy==1.24.3 protobuf==3.20.0
pip install piqa
## guided-diffusion evaluation environment
git clone https://github.com/openai/guided-diffusion.git
pip install tensorflow==2.8.0
sed -i 's/dtype=np\.bool)/dtype=np.bool_)/g' guided-diffusion/evaluations/evaluator.py # or will encounter the error: "AttributeError: module 'numpy' has no attribute 'bool'".
2. Prepare Model Weights
# Prepare the decoder of SD-VAE
mkdir -p outputs/model_weights/va-vae-imagenet256-experimental-variants
wget https://huggingface.co/hustvl/va-vae-imagenet256-experimental-variants/resolve/main/ldm-imagenet256-f16d32-50ep.ckpt \
--no-check-certificate -O outputs/model_weights/va-vae-imagenet256-experimental-variants/ldm-imagenet256-f16d32-50ep.ckpt
# Prepare evaluation batches of ImageNet 256x256 from guided-diffusion
mkdir -p outputs/ADM_npz
wget https://openaipublic.blob.core.windows.net/diffusion/jul-2021/ref_batches/imagenet/256/VIRTUAL_imagenet256_labeled.npz -O outputs/ADM_npz/VIRTUAL_imagenet256_labeled.npz
# Download files from huggingface
mkdir temp
mkdir -p outputs/dataset/imagenet1k-latents
mkdir -p outputs/train
# Prepare latent statistics
huggingface-cli download SFD-Project/SFD --include "imagenet1k-latents/*" --local-dir temp
mv temp/imagenet1k-latents/* outputs/dataset/imagenet1k-latents/
# Prepare the autoguidance model
huggingface-cli download SFD-Project/SFD --include "model_weights/sfd_autoguidance_b/*" --local-dir temp
mv temp/model_weights/sfd_autoguidance_b outputs/train/
# Prepare XL model (675M)
huggingface-cli download SFD-Project/SFD --include "model_weights/sfd_xl/*" --local-dir temp
mv temp/model_weights/sfd_xl outputs/train/
# Prepare XXL model (1.0B)
huggingface-cli download SFD-Project/SFD --include "model_weights/sfd_1p0/*" --local-dir temp
mv temp/model_weights/sfd_1p0 outputs/train/
rm -rf temp
# or you can directly download the checkpoints from huggingface: https://huggingface.co/SFD-Project/SFD. Put the files in model_weights/ of SFD-Project/SFD to outputs/train
3. Inference
Inference demo
PRECISION=bf16 bash run_fast_inference.sh $INFERENCE_CONFIG
# take XL model (675M) as an example.
CFG_SCALE="1.5" \
AUTOGUIDANCE_MODEL_SIZE="b" \
AUTOGUIDANCE_CKPT_ITER="70" \
PRECISION=bf16 bash run_fast_inference.sh configs/sfd/lightningdit_xl/inference_4m_autoguidance_demo.yaml
Images will be saved into demo_images/demo_samples.png, e.g. the following one:
<p align="center"> <img src="demo_images/demo_samples.png" width="90%"> </p>Inference 50K samples
For without AutoGuidance, run the following command:
# w/o AutoGuidance
FID_NUM=50000 \
GPUS_PER_NODE=$GPU_NUM PRECISION=bf16 bash run_inference.sh \
$INFERENCE_CONFIG
# take XL model (675M) as an example.
FID_NUM=50000 \
GPUS_PER_NODE=8 PRECISION=bf16 bash run_inference.sh \
configs/sfd/lightningdit_xl/inference_4m.yaml
More inference configs can be found in configs/sfd/lightningdit_xl and configs/sfd/lightningdit_1p0, corresponding to XL (675M) and XXL (1.0B) models, respectively.
For with AutoGuidance, run the following command:
# w/ AutoGuidance
CFG_SCALE="$GUIDANCE_SCALE" \
AUTOGUIDANCE_MODEL_SIZE="b" \
AUTOGUIDANCE_CKPT_ITER="$GUIDANCE_ITER" \
FID_NUM=50000 \
GPUS_PER_NODE=$GPU_NUM PRECISION=bf16 bash run_inference.sh \
$INFERENCE_CONFIG
# take XL model (675M) as an example.
CFG_SCALE="1.5" \
AUTOGUIDANCE_MODEL_SIZE="b" \
AUTOGUIDANCE_CKPT_ITER="70" \
FID_NUM=50000 \
GPUS_PER_NODE=8 PRECISION=bf16 bash run_inference.sh \
configs/sfd/lightningdit_xl/inference_4m_autoguidance.yaml
More inference configs can be found in configs/sfd/lightningdit_xl and configs/sfd/lightningdit_1p0, corresponding to XL (675M) and XXL (1.0B) models, respectively. For with AutoGuidance, the detailed parameters for each configuration are shown in the following table:
| Model | Epochs | Params | Degraded Model | Iterations | Guidance Scale | |-------|--------|--------|----------------|------------|----------------| | LightningDiT-XL | 80 | 675M | LightningDiT-B | 70K | 1.6 | | LightningDiT-XL | 800 | 675M | LightningDiT-B | 70K | 1.5 | | LightningDiT-XXL | 80 | 1.0B | LightningDiT-B | 60K | 1.5 | | LightningDiT-XXL | 800 | 1.0B | LightningDiT-B | 120K | 1.5 |
4. Evaluation
# get final scores via guided-diffusion's evaluation tools
bash run_eval_via_guided_diffusion.sh $OUTPUT_IMAGES_DIR
# e.g.,
bash run_eval_via_guided_diffusion.sh outputs/train/sfd_xl/lightningdit-xl-1-ckpt-4000000-dopri5-250-balanced
Note that our models were trained and evaluated on 16 NPUs (consistent with the results reported in our paper). When testing on 8 A100 GPUs, we observed minor performance variations. The detailed results are presented below:
Without AutoGuidance | Model | Epochs | #Params | FID (NPU) | FID (GPU) | |-------|--------|---------|-----------|-----------| | SFD-XL | 80 | 675M | 3.43 | 3.50 | | SFD-XL | 800 | 675M | 2.54 | 2.66 | | SFD-XXL | 80 | 1.0B | 2.84 | 2.92 | | SFD-XXL | 800 | 1.0B | 2.38 | 2.36 |
With AutoGuidance | Model | Epochs | #Params | FID (NPU) | FID (GPU) | |-------|--------|---------|-----------|-----------| | SFD-XL | 80 | 675M | 1.30 | 1.29 | | SFD-XL | 800 |
