<h1 align="center"> <img src="./assets/lr_logo2.png" width="100" alt="lightreasoner-logo" /> 💡 LightReasoner: Can SMALL Language Models Teach LARGE Language Models Reasoning? </h1>  <h3 align="center"> <a href="https://scholar.google.com/citations?user=BGT3Gb8AAAAJ&hl=en" target="_blank"> Jingyuan Wang</a> · <a href="https://scholar.google.com/citations?user=k6yAt6IAAAAJ&hl=en&oi=sra" target="_blank"> Yankai Chen</a> · <a href="https://scholar.google.com/citations?user=__9uvQkAAAAJ&hl=en" target="_blank"> Zhonghang Li</a> · <a href="https://scholar.google.com/citations?user=Zkv9FqwAAAAJ&hl=en" target="_blank"> Chao Huang</a> </h3> <img src="./assets/welcome.png" width="500" alt="Welcome banner"/>  <div align="center">

$Baselines$

</div>

<img src="./assets/lr_bars.png" width="800" /> Figure 1: LightReasoner delivers superior performance with remarkable token efficiency - achieving consistent improvements in zero-shot pass@1 accuracy while dramatically reducing computational overhead by 90% in total time, 80% in sampled problems, and 99% in tuned tokens compared to traditional SFT.

💡 Key Insight:

This efficiency breakthrough shows that strategic token selection, rather than exhaustive training, most effectively unlocks the latent potential of LLM reasoning — proving that smarter, not blindly harder is the path to scalable AI improvement.

🎉 News

[x] [2025/10/14] 🚀 New Release: LRsamples — Pre-collected LightReasoner training samples ready for immediate fine-tuning. This dataset enables direct model training without requiring the full sampling pipeline, streamlining reproduction efforts and accelerating downstream research workflows.
[x] [2025/10/14] 🚀 New Release: LightReasoner Enhanced Models now available on 🤗 Hugging Face Hub. Ready-to-use models fine-tuned with our efficient reasoning enhancement approach for immediate deployment and experimentation.
[x] [2025/10/12] 🚀 New Release: Core implementation with Qwen2.5-Math and DeepSeek-R1 models.

⚡ TL;DR

✨ LightReasoner ✨ flips the script on AI training — small language models (SLMs) don’t just learn from large ones (LLMs); they can actually teach LLMs better and faster!

🔥 The Challenge:

Supervised Fine-Tuning (SFT) struggles with three core bottlenecks:

📊 Data-Intensive: Relies on human-labeled or rejection-sampled datasets.
⚖️ Uniform Learning: Trains all tokens equally, even though only a small portion truly matter.
🔗 Ground-Truth Dependency: Hinders adaptability to new domains and reasoning formats.

🔍 Key Insight:

We allocate 90% of compute to what models already know, while under-investing in the critical 10% that truly drives breakthroughs.

📈 LightReasoner: Better and Faster

Tested across 7 benchmarks × 5 models

🚀 Performance Gains

LightReasoner consistently boosts reasoning accuracy across multiple datasets:

📈 Qwen2.5-Math-1.5B: +28.1% on GSM8K, +25.1% on MATH, +7.2% on SVAMP, +11.7% on ASDIV
📈 DeepSeek-R1-Distill-Qwen-1.5B: +4.3% on GSM8K, +6.0% on MATH, +17.4% on OlympiadBench
📈 Qwen2.5-Math-7B: +10.4% on GSM8K, +6.0% on MATH, +9.3% on SVAMP, +7.9% on ASDIV
📈 Qwen2.5-Math-1.5B-Instruct: +1.9% on GSM8K, +2.6% on Minerva Math
🌍 Strong generalization: Trained only on GSM8K, yet improves across 7 benchmarks

⚡ Efficiency Breakthrough

Taking Qwen2.5-Math-1.5B as an example, LightReasoner achieves dramatic efficiency gains compared with SFT:

⏱️ 90% less total time: 4 hours → 0.5 hours
🧾 80% fewer sampled problems: 3,952 → 1,000 problems
🔢 99% fewer tuned tokens: 1.77M → 20K tokens

🌟 Key Features

🎯 SLM–LLM Teaching:

Counterintuitively uses smaller “amateur” models to identify critical reasoning moments where stronger “expert” models should focus their learning.
⚡ Extreme Token Efficiency:

Achieves 99% fewer tuned tokens than SFT by selectively optimizing high-impact reasoning steps instead of training uniformly on full trajectories.
🔄 Three-Stage Lightweight Framework:

(1) Critical step selection via Expert-Amaeteur KLD detection

(2) Contrastive supervision capturing expert-amateur behavioral differentials

(3) Self-distillation for internalizing expert strengths
📈 KL-Guided Learning:

Leverages behavioral divergence between expert and amateur predictions to pinpoint reasoning bottlenecks — all without requiring ground-truth labels.
🧠 Expertise Over Scale:

Demonstrates that domain expertise gaps, rather than model size, drive effective contrast — even same-sized models with different knowledge can generate powerful teaching signals.

🧩 LightReasoner Framework

<img src="./assets/lr_new.png" width="800" /> Figure 2: Overview of the LightReasoner framework. (1) Sampling Stage: Expert and Amateur models generate distributions πE and πA. Informative step selection retains steps with DKL(πE ∥ πA) > β, and contrastive supervision constructs soft labels vC capturing the Expert's advantage through Expert–Amateur contrast. (2) Fine-tuning Stage: The Expert model is enhanced by minimizing the KL divergence between its output and vC.

🚀 Quick Start

LightReasoner is incredibly easy to use. We’ve designed it to be accessible — so anyone can try it out and experience its “counterintuitive effectiveness” firsthand. No sweat — you’ll have it set up and running with your model of choice in just a few 🪄 simple steps below!

📦 Get Ready

git clone https://github.com/HKUDS/LightReasoner.git
cd LightReasoner

1️⃣ Install all dependencies:

pip install -r requirements.txt

2️⃣ Download the Expert and Amateur models of your choice. For example:

🦉 Expert Model

huggingface-cli download Qwen/Qwen2.5-Math-1.5B --local-dir ./Qwen2.5-Math-1.5B

🐣 Amateur Model

huggingface-cli download Qwen/Qwen2.5-0.5B --local-dir ./Qwen2.5-0.5B

3️⃣ Prepare the training data:

python data_prep.py

⚠️ Caveat

LightReasoner relies on Expert–Amateur model pairing to generate supervision signals. Thus, the choice of this pair is crucial to the method’s success.

⚖️ Rule of Thumb:

The Expert should significantly outperform the Amateur, while the Amateur must remain competent enough to produce coherent reasoning. In practice, performance peaks at a balanced “sweet spot” rather than simply widening the capability gap.

In our experiments, the Experts include Qwen2.5-Math-1.5B, 7B, their Instruct counterparts, and DeepSeek-R1-Distill variants. The Amateur is fixed as Qwen2.5-0.5B, which offers strong contrast while maintaining sufficient reasoning ability to yield meaningful signals.

You’re encouraged to explore other model families (e.g., Llama), but keep this balance principle in mind when setting up your Expert–Amateur collaboration.

📋 Note

We use GSM8K by default for its emphasis on step-by-step, broadly applicable logical reasoning rather than domain-specific notation. This ensures that the Amateur, despite lacking math-specific training, can still produce interpretable outputs suitable for contrastive supervision.
You’re absolutely free to try other datasets — LightReasoner is fully adaptable. However, depending on your dataset, you may need to adjust hyperparameters and the choice of Amateur model to ensure stable training and meaningful contrasts.
- For instance, if you experiment with the MATH dataset — a collection of high-school competition problems that are significantly harder than GSM8K — it’s recommended to upgrade the Amateur model from a generic Qwen2.5 base model to the specialized Qwen2.5-Math variant. The base models were not math-pretrained and may struggle to produce coherent outputs on MATH, potentially destabilizing the expert–amateur contrast.
- T

LightReasoner

Install / Use

README