R2R
[NeurIPS'25] The official code implementation for paper "R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing"
Install / Use
/learn @thu-nics/R2RREADME
Roads to Rome (R2R) intelligently combines small and large language models by routing only critical, reasoning-divergent tokens to the large model.
https://github.com/user-attachments/assets/382fabd8-a816-44ba-b100-b8dd047c3bcb
By combining DeepSeek's R1-1.5B and R1-32B models, R2R-5.6B achieves a 2.8× speedup over R1-32B while surpassing R1-7B and R1-14B by 1.6× and 1.1× in accuracy on challenging math, coding, and QA benchmarks.
@article{fu2025r2r,
title={R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing},
author={Tianyu Fu and Yi Ge and Yichen You and Enshu Liu and Zhihang Yuan and Guohao Dai and Shengen Yan and Huazhong Yang and Yu Wang},
journal={arXiv preprint arXiv:2505.21600},
year={2025},
}
⭐ Feel free to star this repo or cite our paper if you find it useful!
📰 News
-
[2026/01]
v0.1release! Major system update to support online serving with OpenAI-compatible API. Support batch inference, CUDA graph, and much more, with a cleaner interface. -
[2025/10] Added support for the Qwen3 model family. Router checkpoints are now available here.
-
[2025/09] Accepted by the NeurIPS'25 conference.
-
[2025/06] Support sampling on Deepseek's R1-1.5B and R1-32B models.
🔗 Interactive Demo
Check out our interactive demo and see R2R in action by visiting our project page.
🛠️ Environment Setup
Create new conda environment
conda create -n r2r python=3.10
conda activate r2r
Install all required packages with uv
pip install uv
uv pip install -e .
<details>
<summary>Trouble Shooting</summary>
- If you do not wish to use
uv, You can also install usingpip:
pip install -e .
pip install sgl-kernel==0.3.8
- If you accidentally install the wrong flashinfer and encounter related issue, please uninstall it before re-installation.
pip uninstall flashinfer-python
rm -rf ~/.cache/flashinfer/
rm -rf ~/.triton/cache
</details>
🚀 Quick Start
R2R is fully compatible with SGLang chat completion API. Simply:
- Launch the server.
python script/inference/launch_r2r_server.py --config-path config/Qwen3-0.6B+Qwen3-32B.yaml --port 30000
- Send requests with any Open-AI compatible API. An example is shown in
script/playground/simple_req.py
import requests
url = f"http://localhost:30000/v1/chat/completions"
data = {
"model": "default",
"messages": [{"role": "user", "content": "What is the capital of France?"}],
}
response = requests.post(url, json=data)
print(response.json())
<details>
<summary>Custom download and OpenAI client</summary>
- To download existing R2R router checkpoints, like Qwen3-0.6B with Qwen3-8B, use
hf download nics-efc/R2R_router_collections --repo-type model --include "Qwen3-0.6B+Qwen3-8B/**" --local-dir resource
See Pretrained routers for the full list of supported models.
- To use other request methods, like OpenAI client, see examples in
test/test_http_openai_client.pyandtest/test_http_openai_chat_completion.py.
📚 Usage
1. 💬 Run Mix inference with R2R
We provide an interactive example in interactive_chat.py. The main DynamicSimpleSGLangSelector class follows the SGLang offline Engine API and supports the .generate() method for getting responses.
You can use the provided config, or download the pre-trained router from this link and set router_path in the config to the router's local path:
python script/inference/interactive_chat.py --config-path config/Qwen3-0.6B+Qwen3-8B.yaml
The detailed model configurations are in
configfolder.
2. 📊 Benchmark Performance
The following script evaluates R2R's accuracy and speed on AIME24-25, GPQA-Diamond, or LiveCodeBench:
python script/evaluate/hf_dataset_sglang.py --dataset aime --config-path config/Qwen3-0.6B+Qwen3-8B.yaml --use_hybrid
Detailed configurations for benchmark datasets and evaluation metrics are available in script/evaluate/eval_configs/dataset_configs.json. Moreover, our default router_path and threshold settings are provided through script/evaluate/eval_configs/r2r_configs.json.
For speed benchmark, run the following command:
# R2R speed benchmark
python script/playground/speed_benchmark.py --test_r2r --router_path resource/default_router.pt
# SLM/LLM speed benchmark
python script/playground/speed_benchmark.py --test_slm
python script/playground/speed_benchmark.py --test_llm
For an online serving comparison, test/test_speed_comparison.py benchmarks the OpenAI-compatible R2R and SGLang servers on AIME prompts and reports latency, throughput, and per-request speedup under either fixed-RPS or max-batch-size load.
# terminal 1: launch the R2R server
CUDA_VISIBLE_DEVICES=0,1 python script/inference/launch_r2r_server.py --config-path config/Qwen3-0.6B+Qwen3-32B.yaml --port 30000 --tp-size-quick 1 --tp-size-ref 2 --overlap-tp-schedule
# terminal 2: launch the SGLang server
CUDA_VISIBLE_DEVICES=2,3 python -m sglang.launch_server --tp 2 --port 30001 --model-path Qwen/Qwen3-32B
# terminal 3: run the benchmark
python test/test_speed_comparison.py --num-requests 8 --rps 0.1
3. 🧪 Train Your Own R2R Router
To train a custom R2R router for any LLM-SLM pair, you need to:
- Prepare a model preference label dataset
- Train the router using that dataset
<details> <summary>Click to see detailed training instructions</summary>💡 Remember to edit
r2r/utils/model_configs.jsonaccording to your training setup before running the following steps.
3.1 Dataset Preparation
We provide a complete data generation pipeline in script/data_labeling/. You can either use our pre-generated training dataset from Hugging Face and skip to section 3.2, or follow the steps below to create your own dataset.
Initialize Dataset Conversion
Due to varying column names and data structures across different datasets,
this step standardizes all datasets into a unified format for downstream
processing. Customize datasets using --dataset_config:
python script/data_labeling/init_dataset_conversion.py --dataset_config aime,gpqa_extended,Bespoke-Stratos-17k-Code,Bespoke-Stratos-17k-QA --output_dir output/query_dataset
Alternative: Skip this step by using our pre-processed dataset
nics-efc/R2R_query.
Add new dataset: customize the configuration file to standardize new dataset following the format in
script/data_labeling/support_dataset_config.json.
Step0: Generate LLM Responses
Generate responses using a large language model (default: DeepSeek-R1-Distill-Qwen-32B):
python script/data_labeling/step_0_llm_response.py --model_path deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --dataset_path output/query_dataset --output_dir output/query_dataset/LLM_response --tp_size 2
We recommend using complete LLM responses within the 32K token limit for subsequent processing, saved under the datasets_finished/ folder. Alternatively, to use the pre-processed dataset, passing --dataset_path nics-efc/R2R_query --use_hf_dataset in the instruction above.
For faster data generation, we provide code using SGLang API server:
# Start SGLang server python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --tp 2 # Run API inference python script/data_labeling_api/step_0_llm_response.py --api_url http://localhost:30000/v1 --model_path deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --dataset_path output/query_dataset --output_dir output/query_dataset/LLM_response --max_concurrent_requests 16
Step 1: SLM Prefill Analysis
Use the small language model (DeepSeek-R1-Distill-Qwen-1.5B) to prefill and find non-identical LLM responses:
python script/data_labeling/step_1_slm_prefill.py --dataset_path output/query_dataset/LLM_response/dataset_finished --test_model_list deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --output_path output/query_dataset/LLM_response/SLM_prefill
This generates SLM predictions, top-100 logits, and hidden states.
Step 2: LLM Continuation
Use the LLM to continue from SLM's non-identical prefill positions:
python script/data_labeling/step_2_llm_continuation.py --input_path output/query_dataset/LLM_response/SLM_prefill/prediction_comparison.csv --output_path output/query_dataset/LLM_response/SLM_prefill/LLM_continuation_verify --tp_size 2
Note: To use different models or loading path, edit the configuration in
r2r/utils/model_configs.json. Pay attention to configs like special token ids and vocabulary size.
For faster data generation, we provide code using SGLang API server:
# Start SGLang server python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --tp 2 --skip-tokenizer-init --enable-custom-logit-processor # Run API inference python script/data_labeling_api/step_2_llm_continuation.py --input_path output/query_dataset/LLM_response/SLM_prefill/prediction_comparison.csv --output_path output/query_dataset/LLM_response/SLM_prefill/LLM_continuation_verify --max_concurren
Related Skills
node-connect
345.4kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
104.6kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
345.4kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
345.4kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
