SkillAgentSearch skills...

R2R

[NeurIPS'25] The official code implementation for paper "R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing"

Install / Use

/learn @thu-nics/R2R
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<div align="center"> <img src="resource/logo.png" alt="R2R Logo" width="100"/> <h1>Roads to Rome (R2R)</h1> <h3>Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing</h3> <p> <a href="https://fuvty.github.io/R2R_Project_Page/">🌐 <b>Project Page</b></a> • <a href="https://arxiv.org/abs/2505.21600">📑 <b>arXiv</b></a> • <a href="https://huggingface.co/collections/nics-efc/r2r">🤗 <b>HuggingFace</b></a> </p> </div>

Roads to Rome (R2R) intelligently combines small and large language models by routing only critical, reasoning-divergent tokens to the large model.

https://github.com/user-attachments/assets/382fabd8-a816-44ba-b100-b8dd047c3bcb

By combining DeepSeek's R1-1.5B and R1-32B models, R2R-5.6B achieves a 2.8× speedup over R1-32B while surpassing R1-7B and R1-14B by 1.6× and 1.1× in accuracy on challenging math, coding, and QA benchmarks.

@article{fu2025r2r,
    title={R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing}, 
    author={Tianyu Fu and Yi Ge and Yichen You and Enshu Liu and Zhihang Yuan and Guohao Dai and Shengen Yan and Huazhong Yang and Yu Wang},
    journal={arXiv preprint arXiv:2505.21600},
    year={2025},
}

Feel free to star this repo or cite our paper if you find it useful!

📰 News

  • [2026/01] v0.1 release! Major system update to support online serving with OpenAI-compatible API. Support batch inference, CUDA graph, and much more, with a cleaner interface.

  • [2025/10] Added support for the Qwen3 model family. Router checkpoints are now available here.

  • [2025/09] Accepted by the NeurIPS'25 conference.

  • [2025/06] Support sampling on Deepseek's R1-1.5B and R1-32B models.

🔗 Interactive Demo

Check out our interactive demo and see R2R in action by visiting our project page.

🛠️ Environment Setup

Create new conda environment

conda create -n r2r python=3.10
conda activate r2r

Install all required packages with uv

pip install uv
uv pip install -e .
<details> <summary>Trouble Shooting</summary>
  1. If you do not wish to use uv, You can also install using pip:
pip install -e .
pip install sgl-kernel==0.3.8
  1. If you accidentally install the wrong flashinfer and encounter related issue, please uninstall it before re-installation.
pip uninstall flashinfer-python
rm -rf ~/.cache/flashinfer/
rm -rf ~/.triton/cache
</details>

🚀 Quick Start

R2R is fully compatible with SGLang chat completion API. Simply:

  1. Launch the server.
python script/inference/launch_r2r_server.py --config-path config/Qwen3-0.6B+Qwen3-32B.yaml --port 30000
  1. Send requests with any Open-AI compatible API. An example is shown in script/playground/simple_req.py
import requests

url = f"http://localhost:30000/v1/chat/completions"

data = {
    "model": "default",
    "messages": [{"role": "user", "content": "What is the capital of France?"}],
}

response = requests.post(url, json=data)
print(response.json())
<details> <summary>Custom download and OpenAI client</summary>
  1. To download existing R2R router checkpoints, like Qwen3-0.6B with Qwen3-8B, use
hf download nics-efc/R2R_router_collections --repo-type model --include "Qwen3-0.6B+Qwen3-8B/**" --local-dir resource

See Pretrained routers for the full list of supported models.

  1. To use other request methods, like OpenAI client, see examples in test/test_http_openai_client.py and test/test_http_openai_chat_completion.py.
</details>

📚 Usage

1. 💬 Run Mix inference with R2R

We provide an interactive example in interactive_chat.py. The main DynamicSimpleSGLangSelector class follows the SGLang offline Engine API and supports the .generate() method for getting responses.

You can use the provided config, or download the pre-trained router from this link and set router_path in the config to the router's local path:

python script/inference/interactive_chat.py --config-path config/Qwen3-0.6B+Qwen3-8B.yaml

The detailed model configurations are in config folder.

2. 📊 Benchmark Performance

The following script evaluates R2R's accuracy and speed on AIME24-25, GPQA-Diamond, or LiveCodeBench:

python script/evaluate/hf_dataset_sglang.py --dataset aime --config-path config/Qwen3-0.6B+Qwen3-8B.yaml --use_hybrid 

Detailed configurations for benchmark datasets and evaluation metrics are available in script/evaluate/eval_configs/dataset_configs.json. Moreover, our default router_path and threshold settings are provided through script/evaluate/eval_configs/r2r_configs.json.

For speed benchmark, run the following command:

# R2R speed benchmark
python script/playground/speed_benchmark.py --test_r2r --router_path resource/default_router.pt
# SLM/LLM speed benchmark
python script/playground/speed_benchmark.py --test_slm
python script/playground/speed_benchmark.py --test_llm

For an online serving comparison, test/test_speed_comparison.py benchmarks the OpenAI-compatible R2R and SGLang servers on AIME prompts and reports latency, throughput, and per-request speedup under either fixed-RPS or max-batch-size load.

# terminal 1: launch the R2R server
CUDA_VISIBLE_DEVICES=0,1 python script/inference/launch_r2r_server.py --config-path config/Qwen3-0.6B+Qwen3-32B.yaml --port 30000 --tp-size-quick 1 --tp-size-ref 2 --overlap-tp-schedule
# terminal 2: launch the SGLang server
CUDA_VISIBLE_DEVICES=2,3 python -m sglang.launch_server --tp 2 --port 30001 --model-path Qwen/Qwen3-32B

# terminal 3: run the benchmark
python test/test_speed_comparison.py --num-requests 8 --rps 0.1

3. 🧪 Train Your Own R2R Router

To train a custom R2R router for any LLM-SLM pair, you need to:

  1. Prepare a model preference label dataset
  2. Train the router using that dataset

💡 Remember to edit r2r/utils/model_configs.json according to your training setup before running the following steps.

<details> <summary>Click to see detailed training instructions</summary>

3.1 Dataset Preparation

We provide a complete data generation pipeline in script/data_labeling/. You can either use our pre-generated training dataset from Hugging Face and skip to section 3.2, or follow the steps below to create your own dataset.

Initialize Dataset Conversion

Due to varying column names and data structures across different datasets, this step standardizes all datasets into a unified format for downstream processing. Customize datasets using --dataset_config:

python script/data_labeling/init_dataset_conversion.py --dataset_config aime,gpqa_extended,Bespoke-Stratos-17k-Code,Bespoke-Stratos-17k-QA --output_dir output/query_dataset

Alternative: Skip this step by using our pre-processed dataset nics-efc/R2R_query.

Add new dataset: customize the configuration file to standardize new dataset following the format in script/data_labeling/support_dataset_config.json.

Step0: Generate LLM Responses

Generate responses using a large language model (default: DeepSeek-R1-Distill-Qwen-32B):

python script/data_labeling/step_0_llm_response.py --model_path deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --dataset_path output/query_dataset --output_dir output/query_dataset/LLM_response --tp_size 2

We recommend using complete LLM responses within the 32K token limit for subsequent processing, saved under the datasets_finished/ folder. Alternatively, to use the pre-processed dataset, passing --dataset_path nics-efc/R2R_query --use_hf_dataset in the instruction above.

For faster data generation, we provide code using SGLang API server:

# Start SGLang server
python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --tp 2
# Run API inference
python script/data_labeling_api/step_0_llm_response.py --api_url http://localhost:30000/v1 --model_path deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --dataset_path output/query_dataset --output_dir output/query_dataset/LLM_response --max_concurrent_requests 16
Step 1: SLM Prefill Analysis

Use the small language model (DeepSeek-R1-Distill-Qwen-1.5B) to prefill and find non-identical LLM responses:

python script/data_labeling/step_1_slm_prefill.py --dataset_path output/query_dataset/LLM_response/dataset_finished --test_model_list deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --output_path output/query_dataset/LLM_response/SLM_prefill

This generates SLM predictions, top-100 logits, and hidden states.

Step 2: LLM Continuation

Use the LLM to continue from SLM's non-identical prefill positions:

python script/data_labeling/step_2_llm_continuation.py --input_path output/query_dataset/LLM_response/SLM_prefill/prediction_comparison.csv --output_path output/query_dataset/LLM_response/SLM_prefill/LLM_continuation_verify --tp_size 2

Note: To use different models or loading path, edit the configuration in r2r/utils/model_configs.json. Pay attention to configs like special token ids and vocabulary size.

For faster data generation, we provide code using SGLang API server:

# Start SGLang server
python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --tp 2 --skip-tokenizer-init --enable-custom-logit-processor
# Run API inference
python script/data_labeling_api/step_2_llm_continuation.py --input_path output/query_dataset/LLM_response/SLM_prefill/prediction_comparison.csv --output_path output/query_dataset/LLM_response/SLM_prefill/LLM_continuation_verify --max_concurren

Related Skills

View on GitHub
GitHub Stars86
CategoryDevelopment
Updated4h ago
Forks13

Languages

Python

Security Score

95/100

Audited on Apr 2, 2026

No findings