<img src="assets/logo.jpg" alt="Diagonal Distillation logo" width="380"/> <h1 align="center">STREAMING AUTOREGRESSIVE VIDEO GENERATION VIA DIAGONAL DISTILLATION</h1> <a href="https://brandon-liu-jx.github.io/">Jinxiu Liu</a>1 · <a href="">Xuanming Liu</a>2 · <a href="https://kfmei.com/">Kangfu Mei</a>3 · <a href="https://ydwen.github.io/">Yandong Wen</a>2 · <a href="https://faculty.ucmerced.edu/mhyang/">Ming-Hsuan Yang</a>4 · <a href="https://wyliu.com/">Weiyang Liu</a>5 1South China University of Technology 2Westlake University 3Johns Hopkins University 4University of California, Merced 5The Chinese University of Hong Kong <h3 align="center"><a href="https://arxiv.org/abs/2603.09488">Paper</a> | <a href="https://spherelab.ai/diagdistill">Website</a></h3>

We propose Diagonal Distillation, a new method for making high-quality video generation much faster. Current methods are either too slow or create videos with poor motion and errors over time.

https://github.com/user-attachments/assets/97536e89-b784-45ec-980c-e1318cfda185

✨ Highlights

1️⃣ Diagonal Distillation achieves comparable quality to the full-step model while significantly reducing latency. The method yields a 1.88× speedup on 5-second short video generation on a single H100 GPU.

2️⃣ Diagonal Denoising with Diagonal Forcing and Progressive Step Reduction. We give an illustration of our method by starting with five denoising steps for the first chunk and gradually reducing them to two steps by Chunk 7. For chunks with k ≥ 4, we use a fixed two-step denoising process, reusing the Key-Value (KV) cache from the final noisy frame of the preceding chunk. This design preserves temporal coherence while minimizing latency, and the corresponding pseudo-code is provided in the appendix.

3️⃣ Comparative visualization of temporal training strategies for autoregressive video generation using Causal DiT. Four panels illustrate: (a) Teacher Forcing (green boxes for ground-truth frames), (b) Diffusion Forcing (red boxes for noisy latents), (c) Self Forcing (red boxes for model’s own predictions), and (d) Di- agonal Forcing (Ours) (mixed green/red boxes in diagonal patterns). Each row represents sequential frame generation, with arrows indicating causal dependencies. The diagonal pattern in (d) highlights the core inno- vation—blending clean past frames with recent model-generated ones to align training/inference distributions. This visual comparison underscores how Diagonal Forcing bridges gaps in robustness and coherence seen in baseline methods.

Installation

Create a conda environment and install dependencies:

conda create -n dia python=3.10 -y
conda activate dia
pip install -r requirements.txt
pip install flash-attn --no-build-isolation

Quick Start

Download checkpoints

huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir-use-symlinks False --local-dir wan_models/Wan2.1-T2V-1.3B
huggingface-cli download Efficient-Large-Model/LongLive-1.3B --local-dir ./longlive_models

For Stage-1 initialization (init_ckpt / generator_ckpt), you can use either:

# Option A: Self-Forcing ODE init
huggingface-cli download gdhe17/Self-Forcing checkpoints/ode_init.pt --local-dir .

# Option B: Causal-Forcing init ckpt
huggingface-cli download zhuhz22/Causal-Forcing chunkwise/causal_ode.pt --local-dir checkpoints

Then set configs/exp_stage1_all4_odeinit.yaml generator_ckpt to one of:

checkpoints/ode_init.pt
checkpoints/chunkwise/causal_ode.pt

Training

Diagonal Distillation Training

bash train_two_stage_ode_then_diag.sh

Current codebase training is a two-stage pipeline:

Stage 1 (exp_stage1_all4_odeinit): Initialize from either checkpoints/ode_init.pt (Self-Forcing ODE init) or checkpoints/chunkwise/causal_ode.pt (Causal-Forcing init), then run base distillation training to obtain a stable stage-1 checkpoint.
Stage 2 (exp_stage2_diag_from_stage1): Resume from the Stage-1 checkpoint (default: checkpoint_model_001000/model.pt, i.e., Stage-1 1000-step checkpoint) and continue training with diagonal-denoising settings for better later-chunk temporal quality.

Stage 1 is optional. Stage 2 can directly load longlive_models/LongLive-1.3B/models/longlive_base.pt as the initialization checkpoint.

Inference

Use a checkpoint produced by training (for example Stage-2 or Stage-1 output) by setting generator_ckpt in configs/diadistill_inference.yaml, then run:

bash inference.sh

Acknowledgements

This codebase is built on top of the open-source implementation of LongLive by yukang2017 and the Wan2.1 repo.

Citation

If you find this codebase useful for your research, please kindly cite our paper:

  @InProceedings{liu2026diagdistill,
      title={Streaming Autoregressive Video Generation via Diagonal Distillation},
      author={Liu, Jinxiu and Liu, Xuanming and Mei, Kangfu and Wen, Yandong and Yang, Ming-Hsuan and Liu, Weiyang},
      booktitle={ICLR},
      year={2026}
  }

Diagdistill

Install / Use

README