JacobiForcing
Jacobi Forcing: Fast and Accurate Diffusion-style Decoding
Install / Use
/learn @hao-ai-lab/JacobiForcingREADME
Jacobi Forcing is a new training technique that converts LLMs into native causal parallel decoders. Jacobi forcing keeps the causal AR backbone and fixes the AR-to-diffusion mismatch by training the model to handle noisy future blocks along its own Jacobi decoding trajectories.
Jacobi Forcing yields an AR model which behaves like a diffusion-style decoder—decoding multiple tokens per pass, but still from left to right—with up to $4.5\times$ higher tokens-per-forward and $4\times$ wall-clock speedup on coding and math tasks, while retaining near-AR generation quality.
<p align="center"> <picture> <img src="assets/ar_example_demo.gif" width="45%" alt="AR example demo (left)" /> <img src="assets/jacobi_forcing_example_demo.gif" width="45%" alt="Jacobi Forcing example demo (right)" /> </picture> <br/> <i>Demo of on average more than 4x speedup (181.8 TPS vs. 39.81 TPS) by Jacobi Forcing Model in comparison with the AR baseline (Qwen2.5-Coder-7B-Instruct) on coding sessions.</i> </p>Try the chatbot yourself with:
# modify the script to use your local path
streamlit run applications/jacobi_model_chat.py
Contents
Introduction
Why faster decoding?
AR decoding is high-quality but serial: one forward pass per token. Diffusion language models can decode many tokens in parallel, but typically require non-causal objectives and often break KV-cache-friendly serving.
<p align="center"> <img src="assets/decoding_comparison.gif" width="90%" alt="decoding comparison" /> <br/> <i>fig1: Side-by-side comparison between Jacobi forcing decoding and text diffusion decoding, where Jacobi forcing decoding comes with more efficient KV cache reuse and is trained to generate higher quality drafts over a long horizon.</i> </p>Jacobi Forcing bridges this gap by training an AR model to behave like a diffusion-style decoder while staying causal:
- Causal, left-to-right generation with KV-cache reuse
- Parallel token updates within a block of size $n$ (via Jacobi decoding) and training makes such convergence faster
- Multiblock decoding and Rejection recycling to exploit higher-quality draft with higher GPU utilization
Installation
<p align="justify"> <i>This section is demonstrative with path placeholders: adjust to match your repo structure.</i> </p>- Environment setup:
conda create -n jacobi_forcing python=3.12 -y
conda activate jacobi_forcing
- Clone this repository and build from source:
git clone https://github.com/hao-ai-lab/JacobiForcing.git
cd JacobiForcing
- Install dependencies:
pip install -r requirements.txt
Model Weights
Base Models
| Size | Domain | HuggingFace Repo |
| ---- | ------ | -------------------------------- |
| 7B | Code | Qwen/Qwen2.5-Coder-7B-Instruct |
| 7B | Math | Qwen/Qwen2.5-Math-7B-Instruct |
Jacobi Forcing Models
| Size | Domain | Data | HuggingFace Repo |
| ---- | ------ | ------ | ------------------------ |
| 7B | Code | OpenCodeInstruct | JacobiForcing_Coder_7B_v1 |
| 7B | Math | OpenThoughts2 (math split) | JacobiForcing_Math_7B_v1 |
Usage
Training
Jacobi Forcing training involves the following steps:
Prepare training data
Choice A: download existing data from Huggingface.
git lfs clone https://huggingface.co/datasets/JacobiForcing/OpenCodeInstruct_training_data_n32w16
Choice B
- step 1: Collect Jacobi trajectories from a base AR model (intermediate states + fixed-point state for all $n$-token blocks).
# generate trajctories using customized models
bash generate_trajectory/generation/generate_trajectory_opencodeinstruct_greedy.sh
- step 2: training sequence packing and mapping noise schedule to training sequence.
python3 generate_trajectory/data/2_prepare_efficient_cllm_training_data_progressive_noise_window.py \
--input_path {trajectory_data_path} \
--output_path {output_training_seq_path} \
--n_token_seq_length {block_size} \
--window_size {window_size} \
--min_noisy_ratio 0 \
--max_noisy_ratio 1.0 \
--strategy "progressive"
Note: if the target model is not Qwen2.5, first modify generate_trajectory/generation/generate_trajectory_opencodeinstruct_greedy.sh to customize model path, trajectory data destimation, and input data path (you can download our length-bucketed input data from this link for code and this link for math).
Then adapt from the script generate_trajectory/generation/qwen2_modeling_jacobi_forcing_greedy.py to make your target model compatible.
Noise-conditioned training over long horizons
cd /home/lah003/workspace/CLLM2/JacobiForcing
bash scripts/train/train_jacobi_forcing_coder_n32.sh
<p align="center">
<img src="assets/noisy_context_attention_mask.jpeg" width="50%" alt="noise context training" />
<br/>
<i>fig4: Jacobi Forcing uses the attention implementation shown above. It allows logits from clean blocks and noisy blocks to be generated with single forward pass to calculate the progressive consistency loss and AR loss.</i>
</p>
Inference
Inference Engine
A lightweight, self-contained inference engine lives in inference_engine/. It supports autoregressive and Jacobi decoding (greedy & non-greedy) with FlashAttention, paged KV cache, CUDA graph capture, and tensor parallelism implmetend on top of nano-vLLM. On a single GPU the engine reaches 800–1000 tokens/second with Jacobi Forcing models.
# greedy Jacobi correctness test
python inference_engine/tests/test_jacobi_decoding_greedy.py --model-path /path/to/model
# non-greedy distribution similarity test
python inference_engine/tests/test_jacobi_decoding_nongreedy.py --model-path /path/to/model
Multiblock Decoding
Jacobi Forcing decoding typically exposes knobs like:
-
block size
n(tokens updated in parallel) -
rejection recycling verification budget
pool_size -
block count
K(maximum blocks “in flight”) -
activation ratio
r
Recommended starting point (from our grid search):
n=64, K=2, pool_size=4, r=0.85
To run comprehensive grid search profiling for TPS speedup and TPF across different settings, run:
cd JacobiForcing
bash scripts/inference/scanning_hyperparameter_jacobi_decoding_mr.sh
To run a specific decoding setting with multiblock decoding and rejection recycling, run:
# vanilla Jacobi decoding
python3 JacobiForcing/jacobi_forcing_inference_humaneval.py
# with multiblock decoding and rejection recycling
python3 JacobiForcing/jacobi_forcing_inference_MR_humaneval.py
Evaluation
Generation Quality Evaluation
We evaluate baseline models' and Jacobi Forcing models' performance on HumanEval, MBPP, GSM8K and MATH following the settings in evalchemy.
Performance Comparison
| Task | Method | Family | Speedup | TPF | TPS | Acc / Solve $\uparrow$ | |----------|--------------|------------|-----------|-------|-------|---------------| | HumanEval| AR | AR | $1.00\times$ | 1.0 | 41.3 | 87.8% | | | D2F | dLLM | $1.8\times$ | 2.5 | 73.2 | 54.3% | | | Fast-dLLM | dLLM | $1.
