LaSeR
[ICLR 2026] Official repository for the paper "LaSeR: Reinforcement Learning with Last-Token Self-Rewarding"
Install / Use
/learn @RUCBM/LaSeRREADME
LaSeR: Reinforcement Learning with Last-Token Self-Rewarding
We propose LaSeR, a lightweight and effective algorithm that simultaneously optimizes both the reasoning and self-rewarding capabilities of LLMs with minimal additional cost, by introducing a simple MSE loss into the standard RLVR objective. The optimized self-rewarding scores can serve as auxiliary reward signals in both training and testing stages to enhance model performance.
<div style="display: flex; justify-content: center; flex-wrap: wrap;"> <img src="./imgs/laser.png" style="width: 98%" /> </div>News
- [2025.10.16] We release our paper on arxiv. We release the source code and the checkpoints.
Models
| Name | | |------------------------|:------| | Octothinker-3B-Short-LaSeR | hf model | | Qwen2.5-7B-LaSeR | hf model | | ORZ-7B-LaSeR | hf model |
Data
The evaluation data is in the data/ directory. The processed training data can be downloaded from here.
Installation
Our code is mainly based on verl (v0.5.0). To prepare the environment, please follow these steps:
conda create -n verl python==3.10
conda activate verl
cd verl/
USE_MEGATRON=0 bash scripts/install_vllm_sglang_mcore.sh
pip install math-verify
Training
We provide example scripts for GRPO and LaSeR training in the examples/grpo_trainer/. Before running, please download the related datasets to the appropriate locations.
Quick Start
For experiments on Qwen2.5-7B-Base and ORZ-7B:
cd verl/
bash examples/grpo_trainer/run_qwen2_5_7b.sh
For experiments on OctoThinker-3B-Short-Base:
cd verl/
bash examples/grpo_trainer/run_octothinker_3b.sh
You can modify these scripts to adapt training parameters and paths for your own settings. The scripts include all necessary hyper-parameters. Detailed hyper-parameter explanations are in the verl/verl/trainer/config/actor/actor.yaml.
Make sure to set your WANDB_API_KEY if you want to use Weights & Biases logging.
Evaluation
Our evaluation code is in the src/ folder.
Ideally, the self-rewarding score can be calculated directly by performing an additional forward process after the model generates the <EOS> token, obtaining the prediction probability for the pre-specified self-rewarding token. However, this requires modifying the underlying sampling logic of vLLM. In the current version, we have chosen to concatenate the pre-specified tokens after the solutions have been fully generated and then perform separate forward process to obtain the self-rewarding scores. We welcome the community to contribute a PR for a vLLM version that adapts to our method to enable more efficient self-rewarding!
For now, users can run the following script to perform evaluation on the reasoning and self-rewarding capabilities of the target model:
CUDA_VISIBLE_DEVICES=0,1,2,3 sh scripts/run_eval_math.sh
Acknowledgments
Our training code is mainly based on verl. Our training data is adopted from DeepMath-103K. We sincerely thank the contributors for their open-sourcing!
Citation
If you find our work helpful, please kindly cite as
@article{yang2025laser,
title={LaSeR: Reinforcement Learning with Last-Token Self-Rewarding},
author={Yang, Wenkai and Liu, Weijie and Xie, Ruobing and Guo, Yiju and Wu, Lulu and Yang, Saiyong and Lin, Yankai},
journal={arXiv preprint arXiv:2510.14943},
year={2025}
}
