ParallelBench
[ICLR 2026] ParallelBench: Understanding the Tradeoffs of Parallel Decoding in Diffusion LLMs
Install / Use
/learn @furiosa-ai/ParallelBenchREADME
ParallelBench: Understanding the Tradeoffs of Parallel Decoding in Diffusion LLMs
<p align="center"> <img src = "docs/img/banner.png" width="70%" height="auto"> </p> <p align="center"> <a href="https://scholar.google.com/citations?user=Q-ARWkwAAAAJ&hl=eh" target="_blank">Wonjun Kang</a><sup>*1,5</sup>, <a href="https://kevingalim.com" target="_blank">Kevin Galim</a><sup>*1</sup>, <a href="https://scholar.google.com/citations?user=IXJcR1gAAAAJ&hl=en" target="_blank">Seunghyuk Oh</a><sup>*1</sup>, <a href="https://scholar.google.com/citations?user=XJXKp60AAAAJ&hl=en" target="_blank">Minjae Lee</a><sup>1</sup>, <a href="https://yzeng58.github.io/zyc_cv/" target="_blank">Yuchen Zeng</a><sup>2,3</sup>, <a href="https://scholar.google.com/citations?user=jkXzD7YAAAAJ&hl=en" target="_blank">Shuibai Zhang</a><sup>2</sup>,<br> <a href="https://scholar.google.com/citations?user=si-368wAAAAJ&hl=en" target="_blank">Coleman Hooper</a><sup>4</sup>, <a href="https://yuezhouhu.github.io/" target="_blank">Yuezhou Hu</a><sup>4</sup>, <a href="https://scholar.google.com/citations?user=Oyy8aDMAAAAJ&hl=en" target="_blank">Hyung Il Koo</a><sup>1</sup>, <a href="https://ece.snu.ac.kr/en/research-faculty/faculty/fulltime?md=view&profid=p041" target="_blank">Nam Ik Cho</a><sup>5</sup>, <a href="https://kangwooklee.com/aboutme/" target="_blank">Kangwook Lee</a><sup>2,6,7</sup> </p> <p align="center"> <sup>1</sup>FuriosaAI, <sup>2</sup>UW-Madison, <sup>3</sup>Microsoft Research, <sup>4</sup>UC Berkeley,<br> <sup>5</sup>Seoul National University, <sup>6</sup>KRAFTON, <sup>7</sup>Ludo Robotics </p> <p align="center"> <a href="https://parallelbench.github.io/"><img alt="Project" src="https://img.shields.io/static/v1?label=Project&message=Github&color=blue&logo=github-pages"></a> <a href="https://arxiv.org/abs/2510.04767"><img alt="arXiv" src="https://img.shields.io/badge/arXiv-2510.04767-b31b1b.svg"></a> </p>🔔 Updates
- Jan 25, 2026 Paper accepted at ICLR 2026! 🎉
- Oct 6, 2025 ParallelBench release!
🌍 Papers Using ParallelBench
The following works have evaluated their methods using ParallelBench. Check out how they tackle the speed-quality trade-off of parallel decoding!
- Enabling Approximate Joint Sampling in Diffusion LMs
- Corrective Diffusion Language Models
- Dependency-Aware Parallel Decoding via Attention for Diffusion LLMs
- Polestar-Cache: Reconciling Parallel Decoding and Accuracy in Diffusion LLMs via Token Drift-Aware KV Cache Recalibration
🗺️ Roadmap
We are currently working to support new models and implement advanced unmasking methods. If you are conducting dLLM research and would like to contribute new models or methods, please open an issue.
New Models
- [ ] Fast-dLLM v2
- [ ] LLaDA-MoE, LLaDA2.x
- [ ] SDAR
- [ ] TraDo
Advanced Unmasking Methods
- [x] WINO
- [x] DUS
- [ ] APD
- [x] SlowFast Sampling
- [ ] EB-Sampler
- [x] KLASS
- [ ] Uncode (formerly, PC-Sampler)
🔎 Overview
<p align="center"> <img src = "docs/img/teaser.png" width="100%" height="auto"> </p>Diffusion LLMs (dLLMs) promise faster generation via parallel decoding. However, this speed often comes at the cost of quality, as they ignore token dependencies, an issue that existing benchmarks do not sufficiently capture. To address this issue, we introduce ParallelBench, the first benchmark designed to rigorously test this trade-off through realistic tasks that humans and autoregressive (AR) LLMs can easily solve, but which cause dLLMs to collapse as parallelism grows. We release ParallelBench to drive research towards truly efficient dLLMs that can overcome this challenge.
Features
- Information-Theoretic Analysis: Error bounds on parallel decoding for tasks with inter-token dependencies, showing accuracy degradation as parallelism grows.
- Quantitative Case Studies: Synthetic list operations (Copy, Replace, Shuffle) with closed-form accuracy formulas that pin down where parallel decoding breaks.
- 17 Benchmark Tasks: Three categories (Waiting Line, Text Writing, Puzzles) that humans and AR LLMs solve easily but expose quality drops in dLLMs under parallel decoding.
📐 Key Concepts
ParallelBench measures how quality degrades as parallelism increases in dLLMs. The key variable is tokens per step (TPS) — the number of tokens generated in parallel at each denoising step.
| Tokens per step | Meaning | | :-: | --- | | 1 | One-by-one decoding (equivalent to AR) | | k | k tokens decoded in parallel per step | | max_tokens | Fully parallel (one-step generation) |
ParallelBench evaluates model + unmasking method combinations. The same model can yield very different quality-speed trade-offs depending on which unmasking method is used.
The benchmark score is PBx — the maximum TPS at which a given combination still achieves at least x% average accuracy across all tasks. For example, PB80 = 8 means the combination can decode up to 8 tokens in parallel while maintaining ≥ 80% accuracy. Higher PBx values indicate better quality preservation under parallel decoding.
For methods with deterministic TPS (top-k family), PBx is the measured TPS value. For methods with variable TPS (threshold, factor, etc.), PBx is computed via linear interpolation between adjacent (TPS, accuracy) points to find the exact TPS where accuracy crosses the threshold.
⚙️ Setup
1. Prerequisites
- NVIDIA GPU: CUDA >= 11.8.
2. Clone
git clone --recurse-submodules https://github.com/furiosa-ai/ParallelBench.git
cd ParallelBench
3. Install
We use uv for faster package installation. The following command will install all dependencies including Python packages, PyTorch, vLLM, and JDK 17 (for grammar-based evaluation metrics).
# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env # Reload PATH to use uv
# Install all dependencies (Python + Java)
make install
Note: JDK 17 is installed locally via the
install-jdkPython package — nosudorequired. If you already have Java installed, the script will skip the installation.
4. Running pb CLI
The pb command is available through the virtual environment. Use either method:
# Option 1: Run directly via uv (no activation needed)
uv run pb <command>
# Option 2: Activate the virtual environment first
source .venv/bin/activate
pb <command>
⚡ Quickstart
# Browse tasks (no GPU required)
uv run pb browse # List all available tasks
uv run pb browse waiting_line/copy # View samples from a specific task
uv run pb browse waiting_line/copy --index 3 # View a specific sample by index
# Run evaluation on a single task
uv run pb eval --model parallelbench_llada \
--model_args model_path=GSAI-ML/LLaDA-1.5 \
--gen_kwargs k=32,unmasking=random \
--tasks parallelbench_waiting_line_copy \
--include_path parallelbench/tasks \
--batch_size 1
🎯 Evaluation Coverage
Tasks
| Category | Task | CLI task name |
| --- | --- | --- |
| Waiting Line (10) | Copy | parallelbench_waiting_line_copy |
| | Insert (index) | parallelbench_waiting_line_insert_index |
| | Insert (random) | parallelbench_waiting_line_insert_random |
| | Remove (index) | parallelbench_waiting_line_remove_index |
| | Remove (random) | parallelbench_waiting_line_remove_random |
| | Replace (index) | parallelbench_waiting_line_replace_index |
| | Replace (random) | parallelbench_waiting_line_replace_random |
| | Reverse | parallelbench_waiting_line_reverse |
| | Shuffle | parallelbench_waiting_line_shuffle |
| | Sort | parallelbench_waiting_line_sort |
| Text Writing (5) | Paraphrasing | parallelbench_text_writing_paraphrasing |
| | Summarization | parallelbench_text_writing_summarization |
| | Words to Sentence (easy) | parallelbench_text_writing_words_to_sentence_easy |
| | Words to Sentence (medium) | parallelbench_text_writing_words_to_sentence_medium |
| | Words to Sentence (hard) | parallelbench_text_writing_words_to_sentence_hard |
| Puzzles (2) | Latin Square (4x4) | parallelbench_puzzles_latin_square_n4 |
| | Sudoku (4x4) | parallelbench_puzzles_sudoku_n4 |
Models
For additional models and unmasking methods, please refer to the Roadmap section.
| Model family | CLI wrapper (--model) | Example model_path |
| --- | --- | --- |
| LLaDA | parallelbench_llada | GSAI-ML/LLaDA-1.5 |
| Dream, DiffuCoder | parallelbench_dream | Dream-org/Dream-v0-Instruct-7B |
| ~~SDAR, TraDo~~ | ~~parallelbench_trado~~ | Disabled (under investigation) |
| SEDD | parallelbench_sedd | louaaron/sedd-medium |
| AR baselines (vLLM) | parallelbench_ar | meta-llama/Llama-3.1-8B-Instruct |
| API models | parallelbench_api | Haiku, Mercury (requires .env keys) |
Adding your own model? See the step-by-step guide and the example in
parallelbench/models/local/example/.
Unmasking Methods
| Strategy | Type | CLI value | Description |
| -------- | ---- | --------- | ----------- |
| Random | Top-k (static) | random
Related Skills
node-connect
352.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
111.1kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
352.0kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
352.0kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
