<img src="assets/jacobi_forcing_logo.jpeg" alt="Jacobi Forcing" width="180" align="center"> <div align="center"><h1> Jacobi Forcing: Fast and Accurate Causal Parallel Decoding</h1></div>  <a href="http://arxiv.org/pdf/2512.14681"> <img src="https://flat.badgen.net/badge/Paper/arXiv/red" alt="Paper"> </a> <a href="https://hao-ai-lab.github.io/blogs/jacobi-forcing/"> <img src="https://flat.badgen.net/badge/Blog/Jacobi%20Forcing/blue" alt="Blog"> </a> <a href="http://huggingface.co/JacobiForcing"> <img src="https://flat.badgen.net/badge/Weights/HuggingFace/yellow" alt="Weights"> </a> <a href="https://opensource.org/licenses/Apache-2.0"> <img src="https://flat.badgen.net/badge/License/Apache--2.0/blue" alt="License"> </a> <a href="https://hyper.ai/notebooks/49266"> <img src="https://flat.badgen.net/badge/Tutorial/Quickstart/green" alt="Tutorial">

Jacobi Forcing is a new training technique that converts LLMs into native causal parallel decoders. Jacobi forcing keeps the causal AR backbone and fixes the AR-to-diffusion mismatch by training the model to handle noisy future blocks along its own Jacobi decoding trajectories.

Jacobi Forcing yields an AR model which behaves like a diffusion-style decoder—decoding multiple tokens per pass, but still from left to right—with up to $4.5\times$ higher tokens-per-forward and $4\times$ wall-clock speedup on coding and math tasks, while retaining near-AR generation quality.

<picture> <img src="assets/ar_example_demo.gif" width="45%" alt="AR example demo (left)" />      <img src="assets/jacobi_forcing_example_demo.gif" width="45%" alt="Jacobi Forcing example demo (right)" /> </picture> Demo of on average more than 4x speedup (181.8 TPS vs. 39.81 TPS) by Jacobi Forcing Model in comparison with the AR baseline (Qwen2.5-Coder-7B-Instruct) on coding sessions.

Try the chatbot yourself with:

# modify the script to use your local path
streamlit run applications/jacobi_model_chat.py

Introduction
Installation
Model Weights
Usage
Citation

Introduction

Why faster decoding?

AR decoding is high-quality but serial: one forward pass per token. Diffusion language models can decode many tokens in parallel, but typically require non-causal objectives and often break KV-cache-friendly serving.

<img src="assets/decoding_comparison.gif" width="90%" alt="decoding comparison" /> fig1: Side-by-side comparison between Jacobi forcing decoding and text diffusion decoding, where Jacobi forcing decoding comes with more efficient KV cache reuse and is trained to generate higher quality drafts over a long horizon.

Jacobi Forcing bridges this gap by training an AR model to behave like a diffusion-style decoder while staying causal:

Causal, left-to-right generation with KV-cache reuse
Parallel token updates within a block of size $n$ (via Jacobi decoding) and training makes such convergence faster
Multiblock decoding and Rejection recycling to exploit higher-quality draft with higher GPU utilization

<img src="assets/trajectory.jpeg" width="90%" alt="higher-quality draft" /> fig2: Illustration of higher quality drafts that emerge from Jacobi Forcing model.

Installation

This section is demonstrative with path placeholders: adjust to match your repo structure.

Environment setup:

conda create -n jacobi_forcing python=3.12 -y
conda activate jacobi_forcing

Clone this repository and build from source:

git clone https://github.com/hao-ai-lab/JacobiForcing.git
cd JacobiForcing

Install dependencies:

pip install -r requirements.txt

Model Weights

Base Models

| Size | Domain | HuggingFace Repo | | ---- | ------ | -------------------------------- | | 7B | Code | Qwen/Qwen2.5-Coder-7B-Instruct | | 7B | Math | Qwen/Qwen2.5-Math-7B-Instruct |

Jacobi Forcing Models

| Size | Domain | Data | HuggingFace Repo | | ---- | ------ | ------ | ------------------------ | | 7B | Code | OpenCodeInstruct | JacobiForcing_Coder_7B_v1 | | 7B | Math | OpenThoughts2 (math split) | JacobiForcing_Math_7B_v1 |

Usage

Training

Jacobi Forcing training involves the following steps:

Prepare training data

Choice A: download existing data from Huggingface.

git lfs clone https://huggingface.co/datasets/JacobiForcing/OpenCodeInstruct_training_data_n32w16

Choice B

step 1: Collect Jacobi trajectories from a base AR model (intermediate states + fixed-point state for all $n$-token blocks).

# generate trajctories using customized models
bash generate_trajectory/generation/generate_trajectory_opencodeinstruct_greedy.sh

step 2: training sequence packing and mapping noise schedule to training sequence.

python3 generate_trajectory/data/2_prepare_efficient_cllm_training_data_progressive_noise_window.py \
    --input_path {trajectory_data_path} \
    --output_path {output_training_seq_path} \
    --n_token_seq_length {block_size} \
    --window_size {window_size} \
    --min_noisy_ratio 0 \
    --max_noisy_ratio 1.0 \
    --strategy "progressive"

Note: if the target model is not Qwen2.5, first modify generate_trajectory/generation/generate_trajectory_opencodeinstruct_greedy.sh to customize model path, trajectory data destimation, and input data path (you can download our length-bucketed input data from this link for code and this link for math).

Then adapt from the script generate_trajectory/generation/qwen2_modeling_jacobi_forcing_greedy.py to make your target model compatible.

<picture> <img src="assets/noise_schedule_and_sequence_packing.gif" width="60%" alt="noise schedule mapping" /> </picture> fig3: Illustration of the training sequence packing process with an example (linear progressive) noise schedule mapping.

Noise-conditioned training over long horizons

cd /home/lah003/workspace/CLLM2/JacobiForcing
bash scripts/train/train_jacobi_forcing_coder_n32.sh

<img src="assets/noisy_context_attention_mask.jpeg" width="50%" alt="noise context training" /> fig4: Jacobi Forcing uses the attention implementation shown above. It allows logits from clean blocks and noisy blocks to be generated with single forward pass to calculate the progressive consistency loss and AR loss.

Inference

Inference Engine

A lightweight, self-contained inference engine lives in inference_engine/. It supports autoregressive and Jacobi decoding (greedy & non-greedy) with FlashAttention, paged KV cache, CUDA graph capture, and tensor parallelism implmetend on top of nano-vLLM. On a single GPU the engine reaches 800–1000 tokens/second with Jacobi Forcing models.

# greedy Jacobi correctness test
python inference_engine/tests/test_jacobi_decoding_greedy.py --model-path /path/to/model

# non-greedy distribution similarity test
python inference_engine/tests/test_jacobi_decoding_nongreedy.py --model-path /path/to/model

Multiblock Decoding

Jacobi Forcing decoding typically exposes knobs like:

block size n (tokens updated in parallel)
rejection recycling verification budget pool_size
block count K (maximum blocks “in flight”)
activation ratio r

<picture> <img src="assets/multiblock_rejection_recycling.gif" width="90%" alt="MR decoding" /> </picture> fig5: Illustration of multiblock Jacobi decoding with rejection recycling. High-quality n-grams from earlier iterations are reused as drafts.

Recommended starting point (from our grid search):

n=64, K=2, pool_size=4, r=0.85

To run comprehensive grid search profiling for TPS speedup and TPF across different settings, run:

cd JacobiForcing
bash scripts/inference/scanning_hyperparameter_jacobi_decoding_mr.sh

To run a specific decoding setting with multiblock decoding and rejection recycling, run:

# vanilla Jacobi decoding
python3 JacobiForcing/jacobi_forcing_inference_humaneval.py

# with multiblock decoding and rejection recycling
python3 JacobiForcing/jacobi_forcing_inference_MR_humaneval.py

Evaluation

Generation Quality Evaluation

We evaluate baseline models' and Jacobi Forcing models' performance on HumanEval, MBPP, GSM8K and MATH following the settings in evalchemy.

Performance Comparison

| Task | Method | Family | Speedup | TPF | TPS | Acc / Solve $\uparrow$ | |----------|--------------|------------|-----------|-------|-------|---------------| | HumanEval| AR | AR | $1.00\times$ | 1.0 | 41.3 | 87.8% | | | D2F | dLLM | $1.8\times$ | 2.5 | 73.2 | 54.3% | | | Fast-dLLM | dLLM | $1.

JacobiForcing

Install / Use

README

Contents