CPRet: A Dataset, Benchmark, and Model for Retrieval in Competitive Programming

Email contact: 2317757009@qq.com

🌐 Try Online Demo

We provide an online demo of the CPRet retrieval service, available at:

This demo can assist in duplicate problem detection by retrieving potentially similar problems, though final identification still requires manual verification.

It also supports similar problem retrieval to help broaden your problem-solving perspective.

You can input either a full problem description or a simplified version, and the system will return the most relevant existing problems.

You can refer to the usage examples of the retrieval platform at: https://github.com/coldchair/CPRet/blob/main/TestCases.md

It runs the same codebase and embedding model as the local deployment (see below), so you can preview its capabilities before setting up your own instance.

🚀 News

Oct 2025: CPRetriever-Prob-Qwen3-4B-2510 Released with Enhanced Retrieval Performance!

We're excited to announce a major update to the CPRetriever model series! We've trained the new CPRetriever-Prob-Qwen3-4B-2510 model based on Qwen3-Embedding-4B, released in June 2025, and it has achieved state-of-the-art results in problem-related retrieval tasks. Concurrently, we've also updated our website's retrieval problem database to the latest Oct 2025 version.

Here's a comparison of model performance:

| model | type | size | Text-to-Code | Code-to-Code | Problem-to-Duplicate | Simplified-to-Full | Avg | | :------------------------ | :--- | :--- | :----------- | :----------- | :------------------- | :----------------- | :----- | | CPRetriever-Code | code | 2B | 70.40 | 70.59 | 38.68 | 81.45 | 65.28 | | CPRetriever-Prob | code | 2B | 56.50 | 70.68 | 60.06 | 90.74 | 69.50 | | CPRetriever-Prob-Qwen3-4B | code | 4B | 65.85 | 70.19 | 71.45 | 95.03 | 75.63 | | CPRetriever-Prob-Qwen3-4B-2510 | code | 4B | 80.84 | 87.10 | 74.33 | 96.15 | 84.61 |

The CPRetriever-Prob-Qwen3-4B-2510 model follows the same training procedure and dataset as CPRetriever-Prob-Qwen3-4B, but was retrained in October 2025 with adjusted data proportions, an extended maximum sequence length of 2048, and optimized hyperparameters for improved performance.

Sept 2025: 🎉 We’re excited to announce that our paper has been accepted to the NeurIPS 2025 D&B Track!

📌 Overview

CPRet is a comprehensive suite for competitive programming retrieval research, consisting of:

A large-scale dataset and benchmark for retrieval tasks in coding contests.
A dual-stage training pipeline with contrastive pretraining and task-specific fine-tuning.
A local retrieval server for simplified description and duplicate problem search, powered by our trained model CPRetriever-Prob-Qwen3-4B-2510.

We define the following four core retrieval tasks to support both practical applications and academic benchmarking:

Text-to-Code (T2C): Retrieve relevant code given a natural language problem description.
Code-to-Code (C2C): Retrieve other implementations of the same problem based on a given solution.
Problem-to-Duplicate (P2D): Detect duplicate or near-duplicate problems from existing contest archives.
Simplified-to-Full (S2F): Retrieve the original full version of a simplified problem.

🧰 Repository Contents

cp-retrieval-server/: Code for running a local retrieval web service.
stage1/: Code for stage-1 contrastive pretraining.
stage2/: Code for stage-2 problem-level fine-tuning.

⚙️ Setup

Environment

Recommended: python >= 3.10
Install dependencies:
```
pip install -r requirements.txt
```
Install PyTorch (with CUDA support if needed): → Refer to: https://pytorch.org/get-started/locally/
PyTorch ≥ 2.0 is recommended.

🔁 Accessing Hugging Face from Restricted Regions

If you're experiencing connectivity issues with Hugging Face, consider using the official mirror:

import os
os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'

Or set it as an environment variable:

export HF_ENDPOINT=https://hf-mirror.com

🚀 Run Local Retrieval Service

Download embeddings:
- You run cp-retrieval-server/download.py to download the problems and embeddings.
- If you are using the new model, CPRetriever-Prob-Qwen3-4B-2510:
  - Please download the following files from HF dataset CPRet-Embeddings into the cp-retrieval-server/ directory:
    - probs_2603.jsonl
    - probs_2603_embs.npy
- If you are using the old model, CPRetriever-Prob:
  - Please download the following files from HF dataset CPRet-Embeddings into the cp-retrieval-server/ directory:
    - probs_embs.npy
    - probs.jsonl
Start the service:
```
cd cp-retrieval-server
```
If you're using the old model (CPRetriever-Prob), set the following environment variables before starting the service:
```
export MODEL_PATH=coldchair16/CPRetriever-Prob
export EMB_PATH=./probs_embs.npy
export PROB_PATH=./probs.jsonl
```
Note: bf16 is enabled by default. If your device does not support it, set the environment variable BF_16=0:
```
export BF_16=0
```
Then, run:
```
python app.py
```
About the Dataset:

The current retrieval problem database (as of Mar 2026) includes problems from the following online judges:
- Codeforces
- Codeforces Gym
- AtCoder
- Leetcode
- SPOJ
- Nowcoder
- Luogu
- Loj
- CodeChef
- AIZU
- UOJ
- QOJ
The data is collected up to Mar 2026. You can add your own data source and generate embeddings using compute_embs.py. Running this process for the current database on a H100 GPU takes approximately 4 GPU hours.

If you have access to a larger or more diverse problem dataset, we welcome contributions and are happy to update the collection — feel free to contact us (2317757009@qq.com) or open an issue/pull request.
System Requirements:

This service can be run on CPU or GPU, depending on your environment. We recommend the following memory for smooth operation:
- For the 2B old models (e.g., CPRetriever-Prob): at least 16GB of system memory or GPU VRAM.
- For the 4B new model (CPRetriever-Prob-Qwen3-4B-2510): 32GB or more of system memory or GPU VRAM.
The above requirements are for fp32; if the device supports bf16, only half of the memory/VRAM is needed.

Typical query latency:
- On CPU (8 cores): 10–20 seconds.
- On GPU (e.g., A800): 0.1–1 seconds.
Inference time depends on the input length.

🏋️‍♀️ Training Instructions

⚠️ Note: Recommended GPU memory ≥ 50 GB to avoid OOM.

🔧 Stage 1: Contrastive Pretraining

cd stage1
torchrun --nproc_per_node=8 train.py

Change --nproc_per_node to match the number of available GPUs.
Use --help to see all configurable hyperparameters.

⚠️ Note on Using `Salesforce/SFR-Embedding-Code-2B_R`

If you are using Salesforce/SFR-Embedding-Code-2B_R as your encoder, make sure to manually disable device_map="auto" when loading the model.

The original code might look like this:

self.model = Gemma2Model.from_pretrained(config._name_or_path, trust_remote_code=True, is_causal=False, device_map="auto")
self.tokenizer = AutoTokenizer.from_pretrained(config._name_or_path, trust_remote_code=True, device_map="auto")

This setting can cause the model to skip training due to automatic device placement. Please change it to:

self.model = Gemma2Model.from_pretrained(config._name_or_path, trust_remote_code=True, is_causal=False, device_map=None)
self.tokenizer = AutoTokenizer.from_pretrained(config._name_or_path, trust_remote_code=True, device_map=None)

Alternatively, you can directly copy the patched file from our repo: 👉 modeling_gemma2.py

Stage 2: Problem-Level Fine-Tuning

cd stage2
torchrun --nproc_per_node=1 train.py

Also supports --help to inspect all args.

🔧 Notable Hyperparameters

--model_path: Can be either an HF model repo (e.g. coldchair16/CPRetriever-Code) or a local directory supporting SentenceTransformer.
--eval_only True: Run evaluation without training.

📫 Citation & License

If you find CPRet useful in your research or applications, please consider citing our paper:

@misc{deng2025cpretdatasetbenchmarkmodel,
  title     = {CPRet: A Dataset, Benchmark, and Model for Retrieval in Competitive Pr

CPRet

Install / Use

README

CPRet: A Dataset, Benchmark, and Model for Retrieval in Competitive Programming

🌐 Try Online Demo

🚀 News

📌 Overview

🧰 Repository Contents

⚙️ Setup

Environment

🔁 Accessing Hugging Face from Restricted Regions

🚀 Run Local Retrieval Service

🏋️‍♀️ Training Instructions

🔧 Stage 1: Contrastive Pretraining

⚠️ Note on Using `Salesforce/SFR-Embedding-Code-2B_R`

Stage 2: Problem-Level Fine-Tuning

🔧 Notable Hyperparameters

📫 Citation & License

CPRet

Install / Use

README

CPRet: A Dataset, Benchmark, and Model for Retrieval in Competitive Programming

🌐 Try Online Demo

🚀 News

📌 Overview

🧰 Repository Contents

⚙️ Setup

Environment

🔁 Accessing Hugging Face from Restricted Regions

🚀 Run Local Retrieval Service

🏋️‍♀️ Training Instructions

🔧 Stage 1: Contrastive Pretraining

⚠️ Note on Using Salesforce/SFR-Embedding-Code-2B_R

Stage 2: Problem-Level Fine-Tuning

🔧 Notable Hyperparameters

📫 Citation & License

⚠️ Note on Using `Salesforce/SFR-Embedding-Code-2B_R`