SkillAgentSearch skills...

GCPO

Group Contrastive Policy Optimazation. Read the paper on arXiv: ๐Ÿ‘‰ https://arxiv.org/abs/2510.07790

Install / Use

/learn @AchoWu/GCPO
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

GCPO

โšก Group Contrastive Policy Optimization (GCPO)

arXiv Hugging Face

Official repository of the paper: GCPO: When Contrast Fails, Go Gold


About

GCPO (Group Contrastive Policy Optimization) is a novel reinforcement learning algorithm designed to enhance the reasoning capabilities of language models, especially in scenarios where the model fails to generate correct responses. Unlike previous methods like GRPO, which rely solely on the modelโ€™s own rollouts, GCPO introduces Golden Answers (GAs) โ€” external reference answers โ€” to guide the modelโ€™s updates when all sampled responses are incorrect.

This approach ensures:

โœ… Full sample utilization โ€” no training data is wasted
๐Ÿง  Knowledge transfer โ€” small models learn reasoning strategies from larger models
๐Ÿš€ Faster convergence and better generalization


๐ŸŽฏ Key Features

  • โœ… Golden Answer Injection โ€” handles failure rollouts by injecting correct reference solutions
  • โš–๏ธ Sequence-Level Importance Sampling โ€” stabilizes training under sparse reward settings
  • ๐Ÿ”ฅ Contrastive Optimization โ€” enhances separation between good and bad reasoning traces
  • โœจ No KL Penalty Needed โ€” encourages diverse yet effective reasoning behaviors
  • ๐Ÿ“š Generalizable โ€” works on math, code, and logical QA tasks

๐Ÿš€ Coming Soon

| Item | Status | |------|--------| | Paper | โœ… Released | | Model Checkpoints | โœ… Released | | GCPO Dataset | โณ Coming soon | | Code (Training + Evaluation) | โณ Coming soon |


๐Ÿ› ๏ธ Model Use

We provide the model weights of GCPO-R1-1.5B, which is trained based on DeepSeek-R1-Distill-Qwen-1.5B using the GCPO algorithm. You can find the model at https://huggingface.co/Ach0/GCPO-R1-1.5B.

โš–๏ธ Evaluation

To evaluate the model on AIME 2024, run:

python3 vllm_eval.py --model_path Ach0/GCPO-R1-1.5B --test_file dataset/AIME24/aime_2024.jsonl --output_path aime2024_result.jsonl  --tensor_parallel_size 4 --mode all

๐Ÿ“Š GCPO Improves Reasoning Performance

GCPO consistently outperforms DAPO.

<p align="center"> <img src="assets/performance.png" alt="Performance Comparison" width="1000"> </p>

๐Ÿ”ง GCPO Training Pipeline

<p align="center"> <img src="assets/gcpo_pipeline.png" alt="GCPO Pipeline" width="1200"> </p>

โœ๏ธ Citation

If you find this work useful, please cite:

@article{wu2025gcpo,
  title={GCPO: When Contrast Fails, Go Gold},
  author={Hao Wu and Wei Liu},
  journal={arXiv preprint arXiv:2510.07790},
  year={2025}
}

Related Skills

View on GitHub
GitHub Stars4
CategoryDevelopment
Updated4mo ago
Forks0

Languages

Python

Security Score

87/100

Audited on Nov 10, 2025

No findings