CodeElo

CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings

Generate Convert Improve

Install / Use

/learn @QwenLM/CodeElo

About this skill

Quality Score

0/100

README

CodeElo

This repository is used to evaluate a model's competition-level code generation abilities on CodeForces with human-comparable Elo ratings and percentiles among humans, using the method proposed in CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings.

[!IMPORTANT] We have open-sourced all of the Elo calculation logic and ranking methods. The BASE_URL provided here points to our automated submission system. In order to prevent meaningless mass submissions and to comply with CodeForces policies, we require verified submissions. Due to ethical considerations, you need to agree to the AGREEMENT to obtain a TOKEN and BASE_URL to use the repository. Please fill in the blanks and email the letter to binyuan.hby@alibaba-inc.com, and we will review it and respond as soon as possible. If you prefer not to use our automated system, you are free to implement your own submission mechanism by configuring the interfaces in api.py.

Quick Start

Send a request via email to obtain your access TOKEN, then set TOKEN variable in environment.

export TOKEN="your_actual_token" # replace with your actual token
export BASE_URL="your_base_url" # replace with base url

To test a local model, you need first host an LLM server. Here's an example:
```
vllm serve Qwen/Qwen2.5-Coder-7B-Instruct
```
If you're testing models via a third-party API, you can modify the get_response function with your custom calling method in llm_client.
To test the model, use the following command:
```
python main.py --model Qwen/Qwen2.5-Coder-7B-Instruct \
    --bid 2000 --eid 2030
```
This command will test all eligible contests with IDs ranging from 2000 to 2030.

Citation

@article{quan2025codeelo,
  title={CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings},
  author={Quan, Shanghaoran and Yang, Jiaxi and Yu, Bowen and Zheng, Bo and Liu, Dayiheng and Yang, An and Ren, Xuancheng and Gao, Bofei and Miao, Yibo and Feng, Yunlong and others},
  journal={arXiv preprint arXiv:2501.01257},
  year={2025}
}

Related Skills

node-connect

351.2k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

110.6k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

351.2k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

351.2k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。