CPRet
[NeurIPS'25] CPRet: A Dataset, Benchmark, and Model for Retrieval in Competitive Programming
Install / Use
/learn @coldchair/CPRetREADME
CPRet: A Dataset, Benchmark, and Model for Retrieval in Competitive Programming
Email contact: 2317757009@qq.com
🌐 Try Online Demo
We provide an online demo of the CPRet retrieval service, available at:
This demo can assist in duplicate problem detection by retrieving potentially similar problems, though final identification still requires manual verification.
It also supports similar problem retrieval to help broaden your problem-solving perspective.
You can input either a full problem description or a simplified version, and the system will return the most relevant existing problems.
You can refer to the usage examples of the retrieval platform at: https://github.com/coldchair/CPRet/blob/main/TestCases.md
It runs the same codebase and embedding model as the local deployment (see below), so you can preview its capabilities before setting up your own instance.
🚀 News
Oct 2025: CPRetriever-Prob-Qwen3-4B-2510 Released with Enhanced Retrieval Performance!
We're excited to announce a major update to the CPRetriever model series! We've trained the new CPRetriever-Prob-Qwen3-4B-2510 model based on Qwen3-Embedding-4B, released in June 2025, and it has achieved state-of-the-art results in problem-related retrieval tasks. Concurrently, we've also updated our website's retrieval problem database to the latest Oct 2025 version.
Here's a comparison of model performance:
| model | type | size | Text-to-Code | Code-to-Code | Problem-to-Duplicate | Simplified-to-Full | Avg | | :------------------------ | :--- | :--- | :----------- | :----------- | :------------------- | :----------------- | :----- | | CPRetriever-Code | code | 2B | 70.40 | 70.59 | 38.68 | 81.45 | 65.28 | | CPRetriever-Prob | code | 2B | 56.50 | 70.68 | 60.06 | 90.74 | 69.50 | | CPRetriever-Prob-Qwen3-4B | code | 4B | 65.85 | 70.19 | 71.45 | 95.03 | 75.63 | | CPRetriever-Prob-Qwen3-4B-2510 | code | 4B | 80.84 | 87.10 | 74.33 | 96.15 | 84.61 |
The CPRetriever-Prob-Qwen3-4B-2510 model follows the same training procedure and dataset as CPRetriever-Prob-Qwen3-4B, but was retrained in October 2025 with adjusted data proportions, an extended maximum sequence length of 2048, and optimized hyperparameters for improved performance.
Sept 2025: 🎉 We’re excited to announce that our paper has been accepted to the NeurIPS 2025 D&B Track!
📌 Overview
CPRet is a comprehensive suite for competitive programming retrieval research, consisting of:
- A large-scale dataset and benchmark for retrieval tasks in coding contests.
- A dual-stage training pipeline with contrastive pretraining and task-specific fine-tuning.
- A local retrieval server for simplified description and duplicate problem search, powered by our trained model CPRetriever-Prob-Qwen3-4B-2510.
We define the following four core retrieval tasks to support both practical applications and academic benchmarking:
- Text-to-Code (T2C): Retrieve relevant code given a natural language problem description.
- Code-to-Code (C2C): Retrieve other implementations of the same problem based on a given solution.
- Problem-to-Duplicate (P2D): Detect duplicate or near-duplicate problems from existing contest archives.
- Simplified-to-Full (S2F): Retrieve the original full version of a simplified problem.
🧰 Repository Contents
cp-retrieval-server/: Code for running a local retrieval web service.stage1/: Code for stage-1 contrastive pretraining.stage2/: Code for stage-2 problem-level fine-tuning.
⚙️ Setup
Environment
-
Recommended:
python >= 3.10 -
Install dependencies:
pip install -r requirements.txt -
Install PyTorch (with CUDA support if needed): → Refer to: https://pytorch.org/get-started/locally/
-
PyTorch ≥ 2.0 is recommended.
🔁 Accessing Hugging Face from Restricted Regions
If you're experiencing connectivity issues with Hugging Face, consider using the official mirror:
import os
os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'
Or set it as an environment variable:
export HF_ENDPOINT=https://hf-mirror.com
🚀 Run Local Retrieval Service
-
Download embeddings:
-
You run
cp-retrieval-server/download.pyto download the problems and embeddings. -
If you are using the new model,
CPRetriever-Prob-Qwen3-4B-2510:- Please download the following files from HF dataset CPRet-Embeddings into the
cp-retrieval-server/directory:probs_2603.jsonlprobs_2603_embs.npy
- Please download the following files from HF dataset CPRet-Embeddings into the
-
If you are using the old model,
CPRetriever-Prob:- Please download the following files from HF dataset CPRet-Embeddings into the
cp-retrieval-server/directory:probs_embs.npyprobs.jsonl
- Please download the following files from HF dataset CPRet-Embeddings into the
-
-
Start the service:
cd cp-retrieval-serverIf you're using the old model (
CPRetriever-Prob), set the following environment variables before starting the service:export MODEL_PATH=coldchair16/CPRetriever-Prob export EMB_PATH=./probs_embs.npy export PROB_PATH=./probs.jsonlNote: bf16 is enabled by default. If your device does not support it, set the environment variable BF_16=0:
export BF_16=0Then, run:
python app.py -
About the Dataset:
The current retrieval problem database (as of Mar 2026) includes problems from the following online judges:
The data is collected up to Mar 2026. You can add your own data source and generate embeddings using
compute_embs.py. Running this process for the current database on a H100 GPU takes approximately 4 GPU hours.If you have access to a larger or more diverse problem dataset, we welcome contributions and are happy to update the collection — feel free to contact us (2317757009@qq.com) or open an issue/pull request.
-
System Requirements:
This service can be run on CPU or GPU, depending on your environment. We recommend the following memory for smooth operation:
- For the 2B old models (e.g.,
CPRetriever-Prob): at least 16GB of system memory or GPU VRAM. - For the 4B new model (
CPRetriever-Prob-Qwen3-4B-2510): 32GB or more of system memory or GPU VRAM.
The above requirements are for fp32; if the device supports bf16, only half of the memory/VRAM is needed.
Typical query latency:
- On CPU (8 cores): 10–20 seconds.
- On GPU (e.g., A800): 0.1–1 seconds.
Inference time depends on the input length.
- For the 2B old models (e.g.,
🏋️♀️ Training Instructions
⚠️ Note: Recommended GPU memory ≥ 50 GB to avoid OOM.
🔧 Stage 1: Contrastive Pretraining
cd stage1
torchrun --nproc_per_node=8 train.py
- Change
--nproc_per_nodeto match the number of available GPUs. - Use
--helpto see all configurable hyperparameters.
⚠️ Note on Using Salesforce/SFR-Embedding-Code-2B_R
If you are using Salesforce/SFR-Embedding-Code-2B_R as your encoder, make sure to manually disable device_map="auto" when loading the model.
The original code might look like this:
self.model = Gemma2Model.from_pretrained(config._name_or_path, trust_remote_code=True, is_causal=False, device_map="auto")
self.tokenizer = AutoTokenizer.from_pretrained(config._name_or_path, trust_remote_code=True, device_map="auto")
This setting can cause the model to skip training due to automatic device placement. Please change it to:
self.model = Gemma2Model.from_pretrained(config._name_or_path, trust_remote_code=True, is_causal=False, device_map=None)
self.tokenizer = AutoTokenizer.from_pretrained(config._name_or_path, trust_remote_code=True, device_map=None)
Alternatively, you can directly copy the patched file from our repo: 👉 modeling_gemma2.py
Stage 2: Problem-Level Fine-Tuning
cd stage2
torchrun --nproc_per_node=1 train.py
- Also supports
--helpto inspect all args.
🔧 Notable Hyperparameters
--model_path: Can be either an HF model repo (e.g.coldchair16/CPRetriever-Code) or a local directory supporting SentenceTransformer.--eval_only True: Run evaluation without training.
📫 Citation & License
If you find CPRet useful in your research or applications, please consider citing our paper:
@misc{deng2025cpretdatasetbenchmarkmodel,
title = {CPRet: A Dataset, Benchmark, and Model for Retrieval in Competitive Pr
