CoVerRL

[ACL 2026] CoVerRL: Breaking the Consensus Trap in Label-Free Reasoning via Generator-Verifier Co-Evolution

Generate Convert Improve

Install / Use

/learn @ZJU-REAL/CoVerRL

About this skill

Quality Score

0/100

README

<div align="center"> <h1>CoVerRL: Breaking the Consensus Trap in Label-Free Reasoning via Generator-Verifier Co-Evolution</h1> </div> <div align='center'>

Teng Pan1,2, Yuchen Yan1, Zixuan Wang1,2, Ruiqing Zhang2, Guiyang Hou1, Wenqi Zhang1, Weiming Lu1, Jun Xiao1, Yongliang Shen1,†

1Zhejiang University, 2Baidu Inc.
Preprint.
†Corresponding Author

</div> <img src="docs/static/images/arxiv_logo.png" alt="arXiv" height="14"> <a href="https://arxiv.org/abs/2603.17775">Arxiv</a> | 📑 <a href="https://zju-real.github.io/CoVerRL/">WebPage</a>

🔥 News

2026.04.06: Our paper has been accepted at ACL 2026 Main Conference 🎉🎉🎉!
2026.03.18: We release our paper.

📖 Overview

Label-free reinforcement learning for LLMs typically adopts majority voting to generate pseudo-labels, but suffers from a consensus trap—output diversity collapses during training, leading the model to confidently reinforce systematic self-consistent errors. To address this issue, we propose CoVerRL, a novel framework that unifies generator and verifier roles into a single model via multi-turn reinforcement learning, enabling their mutual bootstrapping and co-evolution without external ground-truth labels.

Our contributions can be summarized as follows:

We identify the consensus trap in majority voting based label-free RL, where diversity collapse causes reward accuracy degradation as models become overconfident in systematic errors, explaining why such methods eventually stagnate.
We propose CoVerRL, a co-evolution framework that unifies generation and verification into a multi-turn RL process, enabling mutual bootstrapping where each capability supervises improvement of the other without external labels.
We validate CoVerRL across Qwen and Llama model families, demonstrating 4-6% improvements over label-free baselines on mathematical reasoning benchmarks while producing verifiers that generalize well to held-out evaluation.

🚀 QuickStart

Preparation

This repository is based on verl v0.6.x branch. Please refer to <a href='https://verl.readthedocs.io/en/latest/start/install.html'>verl installation</a> for setup instructions. Additionally, install <a href='https://github.com/huggingface/Math-Verify'>Math-Verify</a> as the verifier: pip install math-verify. It is recommended to install swanlab or wandb to visualize the training dynamics. pip install swanlab

Before running the script, set the model path in it.

BACKBONE="your backbone"
BACKBONE_PATH="path to your backbone"

TTRL baseline

bash recipe/cover_rl/scripts/gpu/ttrl_baseline.sh

CoVerRL

bash recipe/cover_rl/scripts/gpu/cover_rl.sh

If you want to run with NPU, we also provide scripts in the "npu" folder, feel free to use it.

📊 Dataset

The training data is stored in verl/recipe/cover_rl/data/MATH-7500/math7500_train.parquet. And the validation data is stored in un. If you want to prepare your own dataset, refer to verl/recipe/cover_rl/data/preprocess.py

📈 Main results

Results are reported as Acc.@first / Acc.@final. CoVerRL consistently outperforms TTRL across all models and benchmarks, achieving average improvements of 5.7%, 5.9%, and 4.7% in Acc.@final for the three models respectively.

| Model | Method | MATH500 | AMC | AIME24 | GPQA | Average | |----------------------------|-------------------|----------------|----------------|----------------|----------------|----------------| | Qwen3-1.7B -Base | Base Model | 53.5 / 53.3 | 24.6 / 24.5 | 3.8 / 3.3 | 27.5 / 27.3 | 27.4 / 27.1 | | | TTRL | 65.1 / 65.0 | 31.1 / 30.9 | 5.2 / 5.2 | 30.9 / 30.7 | 33.1 / 33.0 | | | CoVerRL (Ours) | 69.0 / 71.9 | 36.0 / 38.6 | 9.8 / 10.6 | 32.9 / 33.6 | 36.9 / 38.7 | | | Δ | +3.9 / +6.9 | +4.9 / +7.7 | +4.6 / +5.4 | +2.0 / +2.9 | +3.8 / +5.7 | | Llama-3.2-3B -Instruct | Base Model | 42.7 / 41.0 | 17.0 / 15.7 | 4.6 / 5.0 | 26.9 / 26.1 | 22.8 / 22.0 | | | TTRL | 52.6 / 52.2 | 23.8 / 23.3 | 13.8 / 14.0 | 29.8 / 28.2 | 30.0 / 29.4 | | | CoVerRL (Ours) | 55.9 / 59.3 | 28.3 / 32.2 | 16.3 / 16.9 | 32.3 / 32.6 | 33.2 / 35.3 | | | Δ | +3.3 / +7.1 | +4.5 / +8.9 | +2.5 / +2.9 | +2.5 / +4.4 | +3.2 / +5.9 | | Qwen2.5-7B | Base Model | 50.1 / 51.4 | 25.5 / 26.4 | 5.2 / 6.5 | 29.9 / 29.7 | 27.7 / 28.5 | | | TTRL | 73.8 / 74.2 | 42.2 / 42.2 | 12.7 / 12.5 | 35.8 / 35.6 | 41.1 / 41.1 | | | CoVerRL (Ours) | 76.8 / 79.6 | 47.6 / 49.2 | 14.6 / 17.1 | 36.2 / 37.2 | 43.8 / 45.8 | | | Δ | +3.0 / +5.4 | +5.4 / +7.0 | +1.9 / +4.6 | +0.4 / +1.6 | +2.7 / +4.7 |

The figure below shows the training dynamics of reward/label accuracy for TTRL and CoVerRL on Qwen3-1.7B-Base. CoVerRL maintains reward accuracy above around 90% and boosts label accuracy via generator-verifier co-evolution, while TTRL faces reward accuracy degradation and stagnant label accuracy due to the consensus trap.

📄 Citation

If you find our work helpful, feel free to give us a cite.

@misc{pan2026coverrlbreakingconsensustrap,
      title={CoVerRL: Breaking the Consensus Trap in Label-Free Reasoning via Generator-Verifier Co-Evolution}, 
      author={Teng Pan and Yuchen Yan and Zixuan Wang and Ruiqing Zhang and Gaiyang Han and Wanqi Zhang and Weiming Lu and Jun Xiao and Yongliang Shen},
      year={2026},
      eprint={2603.17775},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2603.17775}, 
}

🙏 Acknowledgement

The RL training stack is built on top of the excellent verl framework. Many thanks to the verl team for open-sourcing the infrastructure that this project extends.

📨 Contact Us

If you have any questions, please contact us by email: pt6@zju.edu.cn

Related Skills

node-connect

352.0k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

111.1k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

352.0k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

352.0k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。