SkillAgentSearch skills...

LoPA

LoPA: Scaling dLLM Inference via Lookahead Parallel Decoding

Install / Use

/learn @SJTU-DENG-Lab/LoPA
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<div align="center"> <h1>LoPA: Scaling dLLM Inference via Lookahead Parallel Decoding</h1> </div> <p align="center"> <a href="https://arxiv.org/abs/2512.16229">📄 Paper</a> • <a href="https://SJTU-DENG-Lab.github.io/blogs/lopa/">📝 Blog</a> • <a href="https://github.com/SJTU-DENG-Lab/Diffulex">🚀 Engine</a> • <a href="https://huggingface.co/SJTU-Deng-Lab/D2F_Dream_Instruct_7B_Lora/tree/main">🤗 D2F_Dream_Instruct_7B_Lora</a> • <a href="https://huggingface.co/SJTU-Deng-Lab/D2F_DiffuCoder_Instruct_7B_Lora/tree/main">🤗 D2F_DiffuCoder_Instruct_7B_Lora</a> </p> <hr>

https://github.com/user-attachments/assets/6fb2c8e9-23f9-4025-bda3-14ee7b839c9b

Lookahead Parallel Decoding (LoPA) is a training-free, plug-and-play algorithm designed to break the parallelism bottleneck in Diffusion Large Language Models (dLLMs). By identifying that parallelism is highly sensitive to the Token Filling Order (TFO), LoPA actively searches for optimal TFOs to maximize future confidence.

Key features of LoPA include:

  • Massive Speedup: Increases the Tokens Per Forward pass (TPF) of D2F-Dream to 10.1 on GSM8K and D2F-DiffuCoder to 8.3 on HumanEval+.
  • High Throughput: Achieves a single-sample throughput of 1073.9 tokens/s under multi-GPU deployment using a specialized Branch Parallel (BP) inference system.
  • Training-Free: Works out-of-the-box with existing confidence-driven dLLMs (like D2F and Dream) without requiring weight updates.
<p align="center"> <img src="docs/assets/img/figure1.png" width="100%" alt="Throughput performance">

<small style="color: gray;">Figure 1. Throughput performance of LoPA under guaranteed inference speed. LoPA accelerates the single-sample throughput for D2F-Dream to up to 1073.9 and 856.5 tokens/s on MBPP and GSM8K respectively, significantly outperforming baselines.</small>

</p>

🔥 News

  • Dec 22, 2025: We released the code and paper for LoPA-Dist-NV!
  • Dec 18, 2025: We released the code and paper for LoPA!
  • Dec 2025: LoPA achieves >1000 tokens/s on Ascend 910C hardware.

🔮 Future Works

  • Diffulex: We are working on a new inference framework for dLLMs, which is flexible and easy to extend. Diffulex supports multiple decoding strategies including D2F, BlockDiffusion, and Fast-dLLM-v2, which is soon to be released. You can find the code here.

  • LoPA-SDAR: We will explore adapting LoPA to SDAR and other confidence-driven diffusion language models to further demonstrate its generalizability and effectiveness across diverse model architectures.

Contents

🤔 How It Works

Standard dLLM decoding greedily fills tokens with the highest current confidence, which often leads to suboptimal paths that restrict future parallelism. LoPA solves this by "looking ahead":

  1. Anchor Branch: Maintains the standard confidence-driven path.
  2. Lookahead Branches: Spawns parallel branches exploring alternative high-confidence Token Filling Orders (TFOs).
  3. Parallel Verification: Verifies all branches in a single forward pass and selects the one with the highest Branch Confidence (potential for future parallelism).
<p align="center"> <img src="docs/assets/img/figure3.png" width="100%" alt="Overview of LoPA">

<small style="color: gray;">Figure 2. Overview of Lookahead Parallel Decoding (LoPA). In each iteration, LoPA generates a anchor branch alongside multiple lookahead branches by independently sampling high-confidence positions. A branch confidence verification mechanism then evaluates all branches in parallel to select the optimal path.</small>

</p>

📊 Performance Highlights

LoPA demonstrates significant improvements in Tokens Per Forward pass (TPF) and overall throughput across mathematical reasoning and code generation tasks. It establishes a clear, controllable speed-accuracy trade-off.

<p align="center"> <img src="docs/assets/img/figure4.png" width="100%" alt="Scaling Curves">

<small style="color: gray;">Figure 3. Scaling Curves of LoPA. LoPA scales the TPF for D2F-Dream and D2F-DiffuCoder to up to 10.1 and 8.3 on GSM8k and HumanEval+ respectively, with comparable performance.</small>

</p> <p align="center"> <img src="docs/assets/img/figure2.png" width="100%" alt="Scaling Analysis">

<small style="color: gray;">Figure 4. Scaling analysis of LoPA on D2F-Dream with varying branch counts. The results illustrate that LoPA effectively scales the TPF of D2F to a peak exceeding 10, thereby significantly reducing the total number of decoding steps.</small>

</p>

Accuracy-Preserving Parallelism

<div align="center"> <strong>Table 1. Accuracy-preserving parallelism scaling of Dream on multiple benchmarks.</strong> <table style="width:100%; text-align: center; border-collapse: collapse;"> <thead> <tr style="background-color: #f2f2f2;"> <th rowspan="2" style="border: 1px solid #ddd; padding: 8px;">Model</th> <th rowspan="2" style="border: 1px solid #ddd; padding: 8px;">Decoding algo</th> <th colspan="2" style="border: 1px solid #ddd; padding: 8px;">MBPP 3-shot</th> <th colspan="2" style="border: 1px solid #ddd; padding: 8px;">Math 4-shot</th> <th colspan="2" style="border: 1px solid #ddd; padding: 8px;">HumanEval 0-shot</th> <th colspan="2" style="border: 1px solid #ddd; padding: 8px;">GSM8K 4-shot</th> </tr> <tr style="background-color: #f2f2f2;"> <th style="border: 1px solid #ddd; padding: 8px;">TPF</th> <th style="border: 1px solid #ddd; padding: 8px;">Score</th> <th style="border: 1px solid #ddd; padding: 8px;">TPF</th> <th style="border: 1px solid #ddd; padding: 8px;">Score</th> <th style="border: 1px solid #ddd; padding: 8px;">TPF</th> <th style="border: 1px solid #ddd; padding: 8px;">Score</th> <th style="border: 1px solid #ddd; padding: 8px;">TPF</th> <th style="border: 1px solid #ddd; padding: 8px;">Score</th> </tr> </thead> <tbody> <tr> <td style="border: 1px solid #ddd; padding: 8px;">Dream</td> <td style="border: 1px solid #ddd; padding: 8px;">Vanilla</td> <td style="border: 1px solid #ddd; padding: 8px;">1.0</td> <td style="border: 1px solid #ddd; padding: 8px;"><b>56.2</b></td> <td style="border: 1px solid #ddd; padding: 8px;">1.0</td> <td style="border: 1px solid #ddd; padding: 8px;">33.7</td> <td style="border: 1px solid #ddd; padding: 8px;">1.0</td> <td style="border: 1px solid #ddd; padding: 8px;">55.5</td> <td style="border: 1px solid #ddd; padding: 8px;">1.0</td> <td style="border: 1px solid #ddd; padding: 8px;">72.6</td> </tr> <tr> <td style="border: 1px solid #ddd; padding: 8px;">Dream</td> <td style="border: 1px solid #ddd; padding: 8px;">Fast-dLLM</td> <td style="border: 1px solid #ddd; padding: 8px;">1.9</td> <td style="border: 1px solid #ddd; padding: 8px;">55.6</td> <td style="border: 1px solid #ddd; padding: 8px;">1.9</td> <td style="border: 1px solid #ddd; padding: 8px;"><b>37.6</b></td> <td style="border: 1px solid #ddd; padding: 8px;">1.8</td> <td style="border: 1px solid #ddd; padding: 8px;">55.5</td> <td style="border: 1px solid #ddd; padding: 8px;">2.1</td> <td style="border: 1px solid #ddd; padding: 8px;">72.6</td> </tr> <tr> <td style="border: 1px solid #ddd; padding: 8px;">Dream</td> <td style="border: 1px solid #ddd; padding: 8px;">LoPA</td> <td style="border: 1px solid #ddd; padding: 8px;">3.3</td> <td style="border: 1px solid #ddd; padding: 8px;">54.8</td> <td style="border: 1px solid #ddd; padding: 8px;">3.4</td> <td style="border: 1px solid #ddd; padding: 8px;">37.0</td> <td style="border: 1px solid #ddd; padding: 8px;">2.9</td> <td style="border: 1px solid #ddd; padding: 8px;">53.0</td> <td style="border: 1px solid #ddd; padding: 8px;">3.1</td> <td style="border: 1px solid #ddd; padding: 8px;">73.3</td> </tr> <tr style="background-color: #fafafa;"> <td style="border: 1px solid #ddd; padding: 8px;">D2F-Dream</td> <td style="border: 1px solid #ddd; padding: 8px;">Vanilla</td> <td style="border: 1px solid #ddd; padding: 8px;">2.3</td> <td style="border: 1px solid #ddd; padding: 8px;">53.8</td> <td style="border: 1px solid #ddd; padding: 8px;">2.6</td> <td style="border: 1px solid #ddd; padding: 8px;">36.8</td> <td style="border: 1px solid #ddd; padding: 8px;">2.5</td> <td style="border: 1px solid #ddd; padding: 8px;"><b>56.1</b></td> <td style="border: 1px solid #ddd; padding: 8px;">3.1</td> <td style="border: 1px solid #ddd; padding: 8px;"><b>78.5</b></td> </tr> <tr style="background-color: #e6f7ff;"> <td style="border: 1px solid #ddd; padding: 8px;">D2F-Dream</td> <td style="border: 1px solid #ddd; padding: 8px;">LoPA (Ours)</td> <td style="border: 1px solid #ddd; padding: 8px;"><b>5.4</b></td> <td style="border: 1px solid #ddd; padding: 8px;">56.0</td> <td style="border: 1px solid #ddd; padding: 8px;"><b>8.0</b></td> <td style="border: 1px solid #ddd; padding: 8px;">35.2</td> <td style="border: 1px solid #ddd; padding: 8px;"><b>6.3</b></td> <td style="border: 1px solid #ddd; padding: 8px;"><b>56.1</b></td> <td style="border: 1px solid #ddd; padding: 8px;"><b>10.1</b></td> <td style="border: 1px solid #ddd; padding: 8px;">73.8</td> </tr> </tbody> </table> </div> <div align="center"> <strong>Table 2. Accuracy-preserving parallelism scaling of DiffuCoder.</strong> <table style="width:100%; text-align: center; border-collapse: collapse;"> <thead> <tr style="background-color: #f2f2f2;"> <th rowspan="2" style="border: 1px solid #ddd; p

Related Skills

View on GitHub
GitHub Stars36
CategoryDevelopment
Updated5d ago
Forks1

Languages

Python

Security Score

75/100

Audited on Apr 3, 2026

No findings