Mercury
Code Efficiency Benchmark
Install / Use
/learn @Elfsong/MercuryREADME
Mercury: A Code Efficiency Benchmark for Code Large Language Models 🪐
- Welcome to Mercury!
- Mercury is the first code efficiency benchmark designed for LLM code synthesis tasks.
- It consists of 1,889 programming tasks covering diverse difficulty levels, along with test case generators that produce unlimited cases for comprehensive evaluation.
[March 6, 2026] Release Mercury_Eval for Mercury Evaluation!
[October 8, 2024] Mercury has been accepted to NeurIPS 2024 🌟
[September 20, 2024] We release a way bigger dataset Venus, which supports more languages. It also provides Memory measurement other than Time.
[July 10, 2024] We are building Code Arena now for more efficient Code LLMs evaluation!
[June 24, 2024] We are currently working on the Multilingual Mercury 🚧
[May 26, 2024] Mercury is now available on BigCode 🌟
Mercury Datasets Access
We publish and maintain our datasets at Mercury@HF

How to use Mercury Evaluation
# Option 0 (Mercury Evaluation)
git clone https://github.com/Elfsong/Mercury_Eval.git
cd Mercury_Eval
uv sync --extra all
# Evaluate with a specific model (backend auto-detected)
mercury-eval gpt-4.1 # full evaluation
mercury-eval gemini-2.5-pro --timeout 120 # timeout
mercury-eval Qwen/Qwen2.5-Coder-32B-Instruct --limit 20 # tasks limit
# Option 1 (with BigCode):
# See https://github.com/bigcode-project/bigcode-evaluation-harness/tree/main/docs#mercury
accelerate launch --main_process_port 30003 main.py \
--model bigcode/starcoder2-7b \
--load_in_4bit \
--max_length_generation 2048 \
--tasks mercury \
--n_samples 5 \
--temperature 0.2 \
--batch_size 5 \
--allow_code_execution \
--save_generations \
--metric_output_path starcoder2-7b-mercury-result.json
# Option 2 (this library):
import os
os.environ["OPENAI_API_KEY"] = 'YOUR_OPENAI_KEY'
# Instantiate evaluator with model_name
# Set do_generate to True if you are going to load the specific language model during evaluator initialization.
from src import evaluator as Evaluator
evaluator = Evaluator.DistributeWiseEvaluator(model_name_or_path='openai/gpt-3.5-turbo-1106', do_generate=True)
# Generate code samples
evaluator.generate(num_samples_per_task=1)
# Evaluate code samples using the Mercury benchmark
evaluator.evaluate(num_samples_per_task=1)
Benchmark Visualization
Citation
@article{du2024mercury,
title={Mercury: A Code Efficiency Benchmark for Code Large Language Models},
volume={37},
journal={Advances in Neural Information Processing Systems},
author={Du, Mingzhe and Tuan, Luu Anh and Ji, Bin and Liu, Qian and Ng, See-Kiong},
year={2024},
}
Questions?
Should you have any questions regarding this paper, please feel free to email us (mingzhe@nus.edu.sg). Thank you for your attention!
