GCPO
Group Contrastive Policy Optimazation. Read the paper on arXiv: ๐ https://arxiv.org/abs/2510.07790
Install / Use
/learn @AchoWu/GCPOREADME
GCPO
โก Group Contrastive Policy Optimization (GCPO)
Official repository of the paper: GCPO: When Contrast Fails, Go Gold
About
GCPO (Group Contrastive Policy Optimization) is a novel reinforcement learning algorithm designed to enhance the reasoning capabilities of language models, especially in scenarios where the model fails to generate correct responses. Unlike previous methods like GRPO, which rely solely on the modelโs own rollouts, GCPO introduces Golden Answers (GAs) โ external reference answers โ to guide the modelโs updates when all sampled responses are incorrect.
This approach ensures:
โ
Full sample utilization โ no training data is wasted
๐ง Knowledge transfer โ small models learn reasoning strategies from larger models
๐ Faster convergence and better generalization
๐ฏ Key Features
- โ Golden Answer Injection โ handles failure rollouts by injecting correct reference solutions
- โ๏ธ Sequence-Level Importance Sampling โ stabilizes training under sparse reward settings
- ๐ฅ Contrastive Optimization โ enhances separation between good and bad reasoning traces
- โจ No KL Penalty Needed โ encourages diverse yet effective reasoning behaviors
- ๐ Generalizable โ works on math, code, and logical QA tasks
๐ Coming Soon
| Item | Status | |------|--------| | Paper | โ Released | | Model Checkpoints | โ Released | | GCPO Dataset | โณ Coming soon | | Code (Training + Evaluation) | โณ Coming soon |
๐ ๏ธ Model Use
We provide the model weights of GCPO-R1-1.5B, which is trained based on DeepSeek-R1-Distill-Qwen-1.5B using the GCPO algorithm. You can find the model at https://huggingface.co/Ach0/GCPO-R1-1.5B.
โ๏ธ Evaluation
To evaluate the model on AIME 2024, run:
python3 vllm_eval.py --model_path Ach0/GCPO-R1-1.5B --test_file dataset/AIME24/aime_2024.jsonl --output_path aime2024_result.jsonl --tensor_parallel_size 4 --mode all
๐ GCPO Improves Reasoning Performance
GCPO consistently outperforms DAPO.
<p align="center"> <img src="assets/performance.png" alt="Performance Comparison" width="1000"> </p>๐ง GCPO Training Pipeline
<p align="center"> <img src="assets/gcpo_pipeline.png" alt="GCPO Pipeline" width="1200"> </p>โ๏ธ Citation
If you find this work useful, please cite:
@article{wu2025gcpo,
title={GCPO: When Contrast Fails, Go Gold},
author={Hao Wu and Wei Liu},
journal={arXiv preprint arXiv:2510.07790},
year={2025}
}
Related Skills
node-connect
344.1kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
96.8kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
344.1kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
344.1kQQBot ๅฏๅชไฝๆถๅ่ฝๅใไฝฟ็จ <qqmedia> ๆ ็ญพ๏ผ็ณป็ปๆ นๆฎๆไปถๆฉๅฑๅ่ชๅจ่ฏๅซ็ฑปๅ๏ผๅพ็/่ฏญ้ณ/่ง้ข/ๆไปถ๏ผใ
