POLAR
Pre-trained, Scalable, High-performance Reward Models via Policy Discriminative Learning.
Install / Use
/learn @InternLM/POLARREADME
<img src="./assets/logo.png" width="400"/><br>
🤗 HuggingFace | 🤖 ModelScope | 📜 Paper<br>
</div>Latest News 🎉
- [2025/09] Our POLAR paper has been accepted by Neurips 2025.
- [2025/09] POLAR now supports RFT (Reinforcement Fine-tuning) training using VERL.
Introduction
POLAR represents a significant breakthrough in scalar-based reward models achieved through large-scale pre-training. It leverages the innovative POLicy DiscriminAtive LeaRning (POLAR) paradigm——a scalable, high-level optimization objective——to effectively discriminate between policies using a large-scale synthetic corpora. Following pre-training, POLAR RMs are fine-tuned with minimal preference data, rapidly aligning with human preferences. Key features of POLAR include:
-
Innovative Pre-training Paradigm: POLAR trains a reward model to discern identical policies and discriminate different ones. Unlike traditional reward modeling methods relying on absolute preferences, POLAR captures the relative difference between two policies, which is a scalable, high-level optimization objective suitable for modeling generic ranking relationships.
-
Tailored for Reinforcement Fine-tuning: POLAR assigns rewards to LLM trajectories based on given references, perfectly aligning with the Reinforcement Fine-tuning (RFT) framework. POLAR provides a promising solution for applying RFT in generic scenarios.
-
Superior Performance and Generalization: POLAR achieves state-of-the-art results on downstream reinforcement learning tasks, consistently delivering accurate and reliable reward signals that generalize effectively to unseen scenarios and significantly reducing reward hacking.
-
Easy to Customize: Pre-trained checkpoints of POLAR are available, enabling researchers to conveniently fine-tune the RM for various customized scenarios, thus facilitating straightforward adaptation and expansion tailored to specific applications and experimental requirements.
<img src="./assets/intro.jpeg"/><br>
Model Zoo
We release POLAR reward models in sizes of 1.8B and 7B parameters. The "base" models (POLAR-1.8B-Base and POLAR-7B-Base) refer to pre-trained-only checkpoints, ideal for customized fine-tuning according to specific preferences. The "ready-to-use" checkpoints (POLAR-1.8B and POLAR-7B) have been already fine-tuned on general preference data, making them suitable for immediate use in most scenarios.
| Model | Transformers(HF) | ModelScope(HF) | | -------------------------- | ------------------------------------------ | ---------------------------------------- | | POLAR-1.8B-Base | 🤗 POLAR-1_8B-Base | 🤖 POLAR-1_8B-Base | | POLAR-1.8B | 🤗 POLAR-1_8B | 🤖 POLAR-1_8B | | POLAR-7B-Base | 🤗 POLAR-7B-Base | 🤖 POLAR-7B-Base | | POLAR-7B | 🤗 POLAR-7B | 🤖 POLAR-7B |
Performance
We conducted a comprehensive evaluation of POLAR via the Proximal Policy Optimization (PPO) algorithm. We evaluate the downstream RL performances of four different policy models using OpenCompass. More details are available in our Paper.
<img src="./assets/result.png"/><br>
Quick Start
This repository provides a RewardModelClient class (src/polar/reward_func.py) for querying reward values from a remote POLAR server. It handles input encoding, communication with different backends (sglang, vllm, lmdeploy), and returns the reward scores.
from src.polar import RewardModelClient
Optionally, you can also use XTuner’s implementation by installing XTuner and importing the class from XTuner.
from xtuner.utils import RewardModelClient
For XTuner installation instructions, see the Fine-tune section below.
Inference
We support reward inference through lmdeploy, sglang, and vllm. We recommend setting up a virtual environment with conda when using these inference engines to prevent potential dependency conflicts.
Data format
Unlike traditional reward models, POLAR requires an additional reference trajectory as a demonstration and evaluate candidate trajectories by measuring their consistency with the provided reference.
data = [
{
"prompt": [{"role": "user", "content": "What is the capital of China?"}],
"reference": [{"role": "assistant", "content": "Beijing."}],
"output": [{"role": "assistant", "content": "Beijing."}]
},
{
"prompt": [{"role": "user", "content": "What is the capital of China?"}],
"reference": [{"role": "assistant", "content": "Beijing."}],
"output": [{"role": "assistant", "content": "Shanghai."}]
}
]
Inference with transformers
Reward request
To load the POLAR model using transformers, use the following code to get rewards:
from transformers import AutoModel, AutoTokenizer
from src.polar import RewardModelClient
# from xtuner.utils import RewardModelClient
model_name = 'internlm/POLAR-7B'
model = AutoModel.from_pretrained(
model_name,
device_map="cuda",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
client = RewardModelClient(model_name)
encoded_data = client.encode(data)
batch = tokenizer(encoded_data, return_tensors='pt', padding=True).to('cuda')
outputs = model(**batch)
rewards = outputs[0].squeeze(-1).cpu().tolist()
print(rewards)
# [-0.5702977776527405, -11.030370712280273] for previous example data
Inference with lmdeploy
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
Requirements
- lmdeploy >= 0.9.1
Server Launch
lmdeploy serve api_server internlm/POLAR-7B --backend pytorch --server-port 30000
Client Request
from src.polar import RewardModelClient
# from xtuner.utils import RewardModelClient
client = RewardModelClient("internlm/POLAR-7B",
server_type="lmdeploy",
server_address="127.0.0.1:30000")
# Request rewards directly
rewards = client(data)
print(rewards)
# First encode data and then get rewards via the request function.
encoded_data = client.encode(data)
rewards = client.lmdeploy_request_reward(encoded_data)
print(rewards)
Inference with sglang
Requirements
- 0.4.3.post4 <= sglang <= 0.4.4.post1
Server Launch
python3 -m sglang.launch_server --model internlm/POLAR-7B --trust-remote-code --is-embedding --dp 4 --tp 2 --mem-fraction-static 0.9 --port 30000
Client Request
from src.polar import RewardModelClient
# from xtuner.utils import RewardModelClient
client = RewardModelClient("internlm/POLAR-7B",
server_type="sglang",
server_address="127.0.0.1:30000")
# Request rewards directly
rewards = client(data)
print(rewards)
# First encode data and then get rewards via the request function.
encoded_data = client.encode(data)
rewards = client.sglang_request_reward(encoded_data)
print(rewards)
Inference with vllm
Requirements
- vllm >= 0.8.0
Server Launch
vllm serve internlm/POLAR-7B --task=reward --trust-remote-code --tensor-parallel-size=2 --port 30000
Client Request
from src.polar import RewardModelClient
# from xtuner.utils import RewardModelClient
client = RewardModelClient("internlm/POLAR-7B",
server_type="vllm",
server_address="127.0.0.1:30000")
# Request rewards directly
rewards = client(data)
print(rewards)
# First encode data and then get rewards via the request function.
encoded_data = client.encode(data)
rewards = client.vllm_request_reward(encoded_data)
print(rewards)
RFT with VERL
POLAR can be easily integrated into various reinforcement learning frameworks. This repository provides an example showing how to use VERL for reinforcement fine-tuning (RFT) with POLAR reward models.
Environment Setup
Please refer to the VERL official installation guide for detailed environment setup instructions.
Note: For training Qwen2.5 series, we recommend using the inference backend vLLM 0.8.3 and Transformers 4.50.3 for optimal performance. A higher version of transformers may cause traini
Related Skills
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
Kiln
4.7kBuild, Evaluate, and Optimize AI Systems. Includes evals, RAG, agents, fine-tuning, synthetic data generation, dataset management, MCP, and more.
zotero-mcp
2.3kZotero MCP: Connects your Zotero research library with Claude and other AI assistants via the Model Context Protocol to discuss papers, get summaries, analyze citations, and more.
omg-learn
Learning from user corrections by creating skills and patterns. Patterns can prevent mistakes (block/warn/ask) or inject helpful context into prompts
