SkillAgentSearch skills...

POLAR

Pre-trained, Scalable, High-performance Reward Models via Policy Discriminative Learning.

Install / Use

/learn @InternLM/POLAR
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<div align="center">

<img src="./assets/logo.png" width="400"/><br>

license xtuner lmdeploy sglang vllm verl

🤗 HuggingFace | 🤖 ModelScope | 📜 Paper<br>

English | 简体中文

</div>

Latest News 🎉

  • [2025/09] Our POLAR paper has been accepted by Neurips 2025.
  • [2025/09] POLAR now supports RFT (Reinforcement Fine-tuning) training using VERL.

Introduction

POLAR represents a significant breakthrough in scalar-based reward models achieved through large-scale pre-training. It leverages the innovative POLicy DiscriminAtive LeaRning (POLAR) paradigm——a scalable, high-level optimization objective——to effectively discriminate between policies using a large-scale synthetic corpora. Following pre-training, POLAR RMs are fine-tuned with minimal preference data, rapidly aligning with human preferences. Key features of POLAR include:

  • Innovative Pre-training Paradigm: POLAR trains a reward model to discern identical policies and discriminate different ones. Unlike traditional reward modeling methods relying on absolute preferences, POLAR captures the relative difference between two policies, which is a scalable, high-level optimization objective suitable for modeling generic ranking relationships.

  • Tailored for Reinforcement Fine-tuning: POLAR assigns rewards to LLM trajectories based on given references, perfectly aligning with the Reinforcement Fine-tuning (RFT) framework. POLAR provides a promising solution for applying RFT in generic scenarios.

  • Superior Performance and Generalization: POLAR achieves state-of-the-art results on downstream reinforcement learning tasks, consistently delivering accurate and reliable reward signals that generalize effectively to unseen scenarios and significantly reducing reward hacking.

  • Easy to Customize: Pre-trained checkpoints of POLAR are available, enabling researchers to conveniently fine-tune the RM for various customized scenarios, thus facilitating straightforward adaptation and expansion tailored to specific applications and experimental requirements.

<img src="./assets/intro.jpeg"/><br>

Model Zoo

We release POLAR reward models in sizes of 1.8B and 7B parameters. The "base" models (POLAR-1.8B-Base and POLAR-7B-Base) refer to pre-trained-only checkpoints, ideal for customized fine-tuning according to specific preferences. The "ready-to-use" checkpoints (POLAR-1.8B and POLAR-7B) have been already fine-tuned on general preference data, making them suitable for immediate use in most scenarios.

| Model | Transformers(HF) | ModelScope(HF) | | -------------------------- | ------------------------------------------ | ---------------------------------------- | | POLAR-1.8B-Base | 🤗 POLAR-1_8B-Base | 🤖 POLAR-1_8B-Base | | POLAR-1.8B | 🤗 POLAR-1_8B | 🤖 POLAR-1_8B | | POLAR-7B-Base | 🤗 POLAR-7B-Base | 🤖 POLAR-7B-Base | | POLAR-7B | 🤗 POLAR-7B | 🤖 POLAR-7B |

Performance

We conducted a comprehensive evaluation of POLAR via the Proximal Policy Optimization (PPO) algorithm. We evaluate the downstream RL performances of four different policy models using OpenCompass. More details are available in our Paper.

<img src="./assets/result.png"/><br>

Quick Start

This repository provides a RewardModelClient class (src/polar/reward_func.py) for querying reward values from a remote POLAR server. It handles input encoding, communication with different backends (sglang, vllm, lmdeploy), and returns the reward scores.

from src.polar import RewardModelClient

Optionally, you can also use XTuner’s implementation by installing XTuner and importing the class from XTuner.

from xtuner.utils import RewardModelClient

For XTuner installation instructions, see the Fine-tune section below.

Inference

We support reward inference through lmdeploy, sglang, and vllm. We recommend setting up a virtual environment with conda when using these inference engines to prevent potential dependency conflicts.

Data format

Unlike traditional reward models, POLAR requires an additional reference trajectory as a demonstration and evaluate candidate trajectories by measuring their consistency with the provided reference.

data = [
    {
        "prompt": [{"role": "user", "content": "What is the capital of China?"}],
        "reference": [{"role": "assistant", "content": "Beijing."}],
        "output": [{"role": "assistant", "content": "Beijing."}]
    },
    {
        "prompt": [{"role": "user", "content": "What is the capital of China?"}],
        "reference": [{"role": "assistant", "content": "Beijing."}],
        "output": [{"role": "assistant", "content": "Shanghai."}]
    }
]

Inference with transformers

Reward request

To load the POLAR model using transformers, use the following code to get rewards:

from transformers import AutoModel, AutoTokenizer
from src.polar import RewardModelClient
# from xtuner.utils import RewardModelClient

model_name = 'internlm/POLAR-7B'

model = AutoModel.from_pretrained(
    model_name,
    device_map="cuda", 
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

client = RewardModelClient(model_name)
encoded_data = client.encode(data)
batch = tokenizer(encoded_data, return_tensors='pt', padding=True).to('cuda')
outputs = model(**batch)
rewards = outputs[0].squeeze(-1).cpu().tolist()
print(rewards)
# [-0.5702977776527405, -11.030370712280273] for previous example data

Inference with lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

Requirements

  • lmdeploy >= 0.9.1

Server Launch

lmdeploy serve api_server internlm/POLAR-7B --backend pytorch --server-port 30000

Client Request

from src.polar import RewardModelClient
# from xtuner.utils import RewardModelClient

client = RewardModelClient("internlm/POLAR-7B",
                           server_type="lmdeploy",
                           server_address="127.0.0.1:30000")

# Request rewards directly
rewards = client(data)
print(rewards)

# First encode data and then get rewards via the request function.
encoded_data = client.encode(data)
rewards = client.lmdeploy_request_reward(encoded_data)
print(rewards)

Inference with sglang

Requirements

  • 0.4.3.post4 <= sglang <= 0.4.4.post1

Server Launch

python3 -m sglang.launch_server --model internlm/POLAR-7B --trust-remote-code --is-embedding --dp 4 --tp 2 --mem-fraction-static 0.9 --port 30000

Client Request

from src.polar import RewardModelClient
# from xtuner.utils import RewardModelClient

client = RewardModelClient("internlm/POLAR-7B",
                           server_type="sglang",
                           server_address="127.0.0.1:30000")

# Request rewards directly
rewards = client(data)
print(rewards)

# First encode data and then get rewards via the request function.
encoded_data = client.encode(data)
rewards = client.sglang_request_reward(encoded_data)
print(rewards)

Inference with vllm

Requirements

  • vllm >= 0.8.0

Server Launch

vllm serve internlm/POLAR-7B --task=reward --trust-remote-code --tensor-parallel-size=2 --port 30000

Client Request

from src.polar import RewardModelClient
# from xtuner.utils import RewardModelClient

client = RewardModelClient("internlm/POLAR-7B",
                           server_type="vllm",
                           server_address="127.0.0.1:30000")

# Request rewards directly
rewards = client(data)
print(rewards)

# First encode data and then get rewards via the request function.
encoded_data = client.encode(data)
rewards = client.vllm_request_reward(encoded_data)
print(rewards)

RFT with VERL

POLAR can be easily integrated into various reinforcement learning frameworks. This repository provides an example showing how to use VERL for reinforcement fine-tuning (RFT) with POLAR reward models.

Environment Setup

Please refer to the VERL official installation guide for detailed environment setup instructions.

Note: For training Qwen2.5 series, we recommend using the inference backend vLLM 0.8.3 and Transformers 4.50.3 for optimal performance. A higher version of transformers may cause traini

Related Skills

View on GitHub
GitHub Stars164
CategoryEducation
Updated1mo ago
Forks4

Languages

Python

Security Score

95/100

Audited on Feb 26, 2026

No findings