<div align="center"><h1> PEARL: Parallel Speculative Decoding with Adaptive Draft Length</h1></div> <p align="center"> | <a href="https://arxiv.org/pdf/2408.11850"><b>Paper </b></a> | <a href="https://pearl-code.github.io/"><b>Blog</b></a> | <a href="https://zhuanlan.zhihu.com/p/716769091"><b>知乎</b></a> | </p>

News 🔥

[2025/10] We release nano-PEARL, implementing PEARL with nano-vllm! Check it out!
[2025/02] We release a new version of PEARL paper. link
[2025/01] PEARL is accepted to ICLR 2025

<center> <img style="border-radius: 0.3125em; box-shadow: 0 2px 4px 0 rgba(34,36,38,.12),0 2px 10px 0 rgba(34,36,38,.08);" src="https://s2.loli.net/2024/08/13/u3tc4FAwxQG126y.png" width = "100%" alt=""/> <br> <div style="color:orange; border-bottom: 1px solid #d9d9d9; display: inline-block; color: #999; padding: 2px;"> Figure 1. Speedup on HumanEval. All the experiments are conducted with H100 80G GPUs. The part results of <a href="https://github.com/thunlp/Ouroboros">Ouroboros</a> and <a href="https://github.com/hao-ai-lab/LookaheadDecoding">Lookahead Decoding</a> are reproduced with their official codes. </div> </center> <br>

TL; DR: we introduce PEARL (Parallel spEculative decoding with Adaptive dRaft Length) to further reduce the inference latency of Large Language Models (LLMs). PEARL is a parallel inference framework based on speculative decoding which utilizes pre-verify and post-verify to achieve adaptive draft length. In summary, our PEARL is:

🔥 up to 3.87$\times$, 3.81$\times$, 3.59$\times$ and 3.95$\times$ on HumanEval, GSM8K, MT-bench and MGSM, respectively.

provably lossless

training-free, and does not need additional memory

🔥 can be applied to any algorithms based on draft-then-verify framework, such as EAGLE and Medusa

AR-demo

<p align="center" style="color:gray;">Figure 2. Generation speed of Llama 2 chat 70B using PEARL and auto-regressive decoding, with inference conducted on A100 80G GPUs at bf16 precision. </p>

<br> <div class="columns is-centered has-text-centered"> <div class="column is-four-fifths"> <h2>Overview of PEARL</h2> <div class="content has-text-justified"> </div> </div> </div>

Our PEARL framework consists of a draft model, a target model and two strategies to decode tokens. The two strategies are switched according to the verification results in the last decoding step.

<center> <img style="border-radius: 0.3125em; box-shadow: 0 2px 4px 0 rgba(34,36,38,.12),0 2px 10px 0 rgba(34,36,38,.08);" src="https://s2.loli.net/2024/08/13/aoCAybN7S2KWsXd.png" width = "100%" alt=""/> <br> <div style="color:orange; border-bottom: 1px solid #d9d9d9; display: inline-block; color: #999; padding: 2px;"> Figure 3. Overview of PEARL. PEARL achieves parallelism through adaptively using pre-verify and post-verify. </div> </center>

preparation

Follow the instructions below to prepare for reproducing the results in the paper.

Install dependencies: Run sh install.sh to install all necessary packages. This uses uv for fast installation.
Activate environment: After installation, run source .venv/bin/activate.
Configure paths: Update src/util.py lines 31-38 and line 49 with your model paths and data paths.

reproduction

All the running scripts, including scripts for auto-regress decoding, vanilla speculative decoding, parallel speculative decoding, comparison, ablation studies and case studies. These scripts can be directly executed for reproduction.

sh scripts/run_para_sd.sh

Examples

You can try this code with a simple command:

CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch --num_processes 2 benchmark/eval_humaneval.py --eval_mode para_sd --gamma 5 -n 1  -e H_PSD_codellama_7_70b --draft_model codellama-7b --target_model codellama-70b --max_tokens 1024 --temp 0

With UI

We have provided a suggested web interface, which you can use by running the following command.

<span style="color:lightblue">Currently, This applications.py is just a test demo for visualization, and there are many bugs in the ugly code. We strongly recommend users to refer to benchmark/eval_mt_bench.py. Running this demo with UI MUST enable the button Use PEARL and Highlight the tokens generated by PEARL. </span>

CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch --num_processes 2 applications.py --eval_mode para_sd --gamma 5 -n 1  -e applications --draft_model codellama-7b --target_model codellama-70b --max_tokens 1024 --temp 0

FAQ

AttributeError: 'list' object has no attribute 'get_seq_length'.

In latest transformers (version >= 4.49.0), all the past_key_values are class of DynamicCache instead of tuple. Hence you should change the error line of code from past_key_values[0][0].shape[2] to past_key_values.get_seq_length(). We have fixed some bugs within the code. If you find any bug, feel free to raise an issue.

Unexpected generations, such as meaningless text.

This issue may be directly caused due to precision overflow. You can add .to(torch.float32) to solve this issue. (Such as Line 187 of src/engine.py)

Performance on Qwen Series Model.

We briefly test the speedup of PEARL based on Qwen 2.5 7b & 72b, and find that PEARL can achieve over 2.5$\times$ speedup as well. Any additional experiment is warmly welcomed!

Other details.

Please refer to the <a href="https://zhuanlan.zhihu.com/p/716769091"><b>知乎</b></a> blog.

<div class="columns is-centered has-text-centered"> <div class="column is-four-fifths"> <h2>Citation</h2> <div class="content has-text-justified"> </div> </div> </div>

If you find our work useful your research, please cite our paper:

@inproceedings{
liu2025pearl,
title={{PEARL}: Parallel Speculative Decoding with Adaptive Draft Length},
author={Tianyu Liu and Yun Li and Qitan Lv and Kai Liu and Jianchen Zhu and Winston Hu and Xiao Sun},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=QOXrVMiHGK}
}


@misc{liu2025pearlparallelspeculativedecoding,
      title={PEARL: Parallel Speculative Decoding with Adaptive Draft Length}, 
      author={Tianyu Liu and Yun Li and Qitan Lv and Kai Liu and Jianchen Zhu and Winston Hu and Xiao Sun},
      year={2025},
      eprint={2408.11850},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2408.11850}, 
}

ParallelSpeculativeDecoding

Install / Use

README

preparation

reproduction

Examples

With UI

FAQ