GCPO

Group Contrastive Policy Optimazation. Read the paper on arXiv: 👉 https://arxiv.org/abs/2510.07790

Generate Convert Improve

Install / Use

/learn @AchoWu/GCPO

About this skill

Quality Score

0/100

README

GCPO

⚡ Group Contrastive Policy Optimization (GCPO)

Official repository of the paper: GCPO: When Contrast Fails, Go Gold

About

GCPO (Group Contrastive Policy Optimization) is a novel reinforcement learning algorithm designed to enhance the reasoning capabilities of language models, especially in scenarios where the model fails to generate correct responses. Unlike previous methods like GRPO, which rely solely on the model’s own rollouts, GCPO introduces Golden Answers (GAs) — external reference answers — to guide the model’s updates when all sampled responses are incorrect.

This approach ensures:

✅ Full sample utilization — no training data is wasted
🧠 Knowledge transfer — small models learn reasoning strategies from larger models
🚀 Faster convergence and better generalization

🎯 Key Features

✅ Golden Answer Injection — handles failure rollouts by injecting correct reference solutions
⚖️ Sequence-Level Importance Sampling — stabilizes training under sparse reward settings
🔥 Contrastive Optimization — enhances separation between good and bad reasoning traces
✨ No KL Penalty Needed — encourages diverse yet effective reasoning behaviors
📚 Generalizable — works on math, code, and logical QA tasks

🚀 Coming Soon

| Item | Status | |------|--------| | Paper | ✅ Released | | Model Checkpoints | ✅ Released | | GCPO Dataset | ⏳ Coming soon | | Code (Training + Evaluation) | ⏳ Coming soon |

🛠️ Model Use

We provide the model weights of GCPO-R1-1.5B, which is trained based on DeepSeek-R1-Distill-Qwen-1.5B using the GCPO algorithm. You can find the model at https://huggingface.co/Ach0/GCPO-R1-1.5B.

⚖️ Evaluation

To evaluate the model on AIME 2024, run:

python3 vllm_eval.py --model_path Ach0/GCPO-R1-1.5B --test_file dataset/AIME24/aime_2024.jsonl --output_path aime2024_result.jsonl  --tensor_parallel_size 4 --mode all

📊 GCPO Improves Reasoning Performance

GCPO consistently outperforms DAPO.

🔧 GCPO Training Pipeline

✍️ Citation

If you find this work useful, please cite:

@article{wu2025gcpo,
  title={GCPO: When Contrast Fails, Go Gold},
  author={Hao Wu and Wei Liu},
  journal={arXiv preprint arXiv:2510.07790},
  year={2025}
}

Related Skills

node-connect

344.1k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

96.8k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

344.1k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

344.1k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。