</div>

We introduce SDAR (Synergy of Diffusion and AutoRegression), a large-scale diffusion language model that unites the complementary strengths of autoregressive and discrete diffusion modeling. By merging the training efficiency of autoregressive methods with the highly parallel decoding ability of diffusion models, SDAR delivers performance competitive with state-of-the-art open-source AR models. It sets a new standard as the most powerful diffusion-based language model to date—particularly excelling as a generalist model with strong specialist capabilities.

Highlights:

🚀 Low-Cost AR-to-BlockDiffusion
⚡ 2-4× Faster Inference
🧠 Advanced performance on science reasoning bechmarks (e.g., GPQA and ChemBench)

SDAR is still an early experimental state, we are actively developing more systematic and warmly welcome collaborations in this direction.

🔥 News

[2025-10-29] We have open-sourced our downstream task fine-tuning framework, powered by LlamaFactory. It provides a powerful and user-friendly toolkit for adapting SDAR to your specific needs 🛠️.
[2025-10-10] We've implemented an industrial-grade inference solution for SDAR models on the lmdeploy framework, providing robust and efficient deployment infrastructure for production environments 🚀.
[2025-09-09] We’ve open-sourced the weights for models with various block sizes. Alongside our default model (block size=4), you can now find models with block sizes of 8, 16, 32, 64 on the Hugging Face 🤗.
[2025-08-18] We’ve open-sourced the weights for our SDAR-30B-A3B-Sci model — now available on Hugging Face 🤗.
[2025-08-13] We’ve released the inference code for SDAR models, including a built-in script and a third-party inference engine JetEngine 🚀.
[2025-07-20] We’ve open-sourced the weights for our 1.7B, 4B, 8B dense models, along with our 30B MoE model — now available on Hugging Face 🤗.

📑 Contents

SDAR: A Synergistic Diffusion–AutoRegression Paradigm for Scalable Sequence Generation

⚙️ Usage

Training

For detailed instructions on how to fine-tune the model on your own dataset, please refer to the guide in the training directory: training/README.md.

Inference

transformers>=4.52.4

1. Using the built-in inference script

python generate.py \
  --model_dir=JetLM/SDAR-1.7B-Chat \
  --trust_remote_code

2. Using the prepared inference engine JetEngine (For batch inference and production level speedup)

JetEngine, a lightweight inference engine for the SDAR series built on nano-vllm support both dense and MoE models and Tensor Parallel distributed inference, delivers tons of acceleration compared to the naive implementation.

In our benchmark, we tested the 4B SDAR model with block size 4 (basic acceleration setting) and batch size 128:

On NVIDIA A800, JetEngine reached 1800+ tokens/second.
On NVIDIA H200, JetEngine achieved 3700+ tokens/second using FlashAttention-2 + Triton kernels.

This demonstrates that JetEngine can unlock production-level throughput for SDAR models, making it ideal for both research-scale batch inference and real-world deployment scenarios.

pip install flash-attn --no-build-isolation #Install fa2
git clone https://github.com/JetAstra/SDAR.git
cd SDAR
git submodule update --init --recursive
cd third_party/JetEngine
pip install .

The following example shows how to quickly load a model with JetEngine and run a prompt end-to-end.

import os
from jetengine import LLM, SamplingParams
from transformers import AutoTokenizer

model_path = os.path.expanduser("/path/to/your/model")
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# Initialize the LLM
llm = LLM(
    model_path,
    enforce_eager=True,
    tensor_parallel_size=1,
    mask_token_id=151669,   # Optional: only needed for masked/diffusion models
    block_length=4
)

# Set sampling/generation parameters
sampling_params = SamplingParams(
    temperature=1.0,
    topk=0,
    topp=1.0,
    max_tokens=256,
    remasking_strategy="low_confidence_dynamic",
    block_length=4,
    denoising_steps=4,
    dynamic_threshold=0.9
)

# Prepare a simple chat-style prompt
prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "Explain what reinforcement learning is in simple terms."}],
    tokenize=False,
    add_generation_prompt=True
)

# Generate text
outputs = llm.generate_streaming([prompt], sampling_params)

3. Using the prepared inference engine LMDeploy (For batch inference and production level speedup)

from lmdeploy import pipeline, PytorchEngineConfig, GenerationConfig, ChatTemplateConfig
from lmdeploy.pytorch.tools.utils import Timer, visualize_pipe_out


if __name__ == '__main__':
    model_path = 'JetLM/SDAR-8B-Chat'

    prompts = [
        [dict(role="user", content="Given the function $f(x) = \\frac{4x^2 - 4x + 4}{x^2 + 2x + 4}$, where $x \\in \\mathbb{R}$, determine its minimum value.\nPlease reason step by step, and put your final answer within \\boxed{}.\n")],
        [dict(role="user", content="If the domain of the function $\\log x^2$ is $x < a$ or $x > b$, for some $a$ and $b$, find $a + b$.\nPlease reason step by step, and put your final answer within \\boxed{}.\n")],
        [dict(role="user", content="Find the sum of all integer bases $b>9$ for which $17_{b}$ is a divisor of $97_{b}$.\nRemember to put your final answer within \\boxed{}.\n")],
        [dict(role="user", content="Find the number of ordered pairs $(x,y)$, where both $x$ and $y$ are integers between $-100$ and $100$, inclusive, such that $12x^{2}-xy-6y^{2}=0$.\nRemember to put your final answer within \\boxed{}.\n")],
    ]

    backend_config = PytorchEngineConfig(
            tp=1,
            dtype="float16",
            max_prefill_token_num=4096,
            cache_max_entry_count=0.8,
            dllm_block_length=4,
            dllm_denoising_steps=4,
            dllm_unmasking_strategy="low_confidence_dynamic",
            dllm_confidence_threshold=0.9,
        )
    pipe = pipeline(model_path, backend_config=backend_config)

    gen_config = GenerationConfig(
        top_p=0.95,
        top_k=50,
        temperature=1.0,
        do_sample=False, # greedy decoding
        max_new_tokens=4096,
    )

    outputs = pipe(prompts, gen_config=gen_config)
    print(outputs.text)

📊 Preliminary Experiments

Part I: Scaling the Qwen3 Series with SDAR for General (Non-Reasoning) Tasks

Training Setup

We start from Qwen3-1.7B-Base, Qwen3-4B-Base, Qwen3-8B-Base, and Qwen3-30B-A3B-Base.
Each model is continued-pretrained on 50B tokens (~0.14%) of relatively low-quality open-source data, followed by supervised fine-tuning (4B tokens).

The default model maintains a block size of 4 throughout its entire training process. For block size scaling, we use a block size of 4 during the continued pretraining phase, and directly increase it to the target block size (e.g., 8, 16, 32, or 64) during the SFT phase.

SDAR training: SDAR-1.7B-Chat / SDAR-4B-Chat / SDAR-8B-Chat / SDAR-30B-A3B-Chat.
AR training: Qwen3-1.7B-AR-SFT / Qwen3-30B-AR-SFT.

Evaluation Setup

Decoding
- SDAR family: greedy decoding with block_length = 4, denoising_steps = 4.
- AR baselines: greedy decoding.
Base model sources
- Qwen3-1.7B-Base / Qwen3-30B-Base are taken from the Qwen3 Technical Report.

Experiments of Performance

Table 1. Overall performance across general benchmarks. Benchmark results

[!NOTE]

SDAR-1.7B-Chat is on par with Qwen3-1.7B-AR-SFT across most benchmarks.

SDAR-30B-A3B-Chat performs comparably to Qwen3-30B-AR-SFT.

Experiments of Efficiency

We compare SDAR-30B-A3B-Chat and Qwen3-30B-AR-SFT under static and dynamic decoding:

Static: each decoding step emits a fixed number of tokens, independent of confidence.
Dynamic: within a block, once the confidence exceeds a threshold $\theta$, the decoder generate multiple toke

SDAR

Install / Use

README