Mercury: A Code Efficiency Benchmark for Code Large Language Models 🪐

Welcome to Mercury!
Mercury is the first code efficiency benchmark designed for LLM code synthesis tasks.
It consists of 1,889 programming tasks covering diverse difficulty levels, along with test case generators that produce unlimited cases for comprehensive evaluation.

[March 6, 2026] Release Mercury_Eval for Mercury Evaluation!

[October 8, 2024] Mercury has been accepted to NeurIPS 2024 🌟

[September 20, 2024] We release a way bigger dataset Venus, which supports more languages. It also provides Memory measurement other than Time.

[July 10, 2024] We are building Code Arena now for more efficient Code LLMs evaluation!

[June 24, 2024] We are currently working on the Multilingual Mercury 🚧

[May 26, 2024] Mercury is now available on BigCode 🌟

Mercury Datasets Access

We publish and maintain our datasets at Mercury@HF

Mercury_Eval

How to use Mercury Evaluation

# Option 0 (Mercury Evaluation)
git clone https://github.com/Elfsong/Mercury_Eval.git
cd Mercury_Eval
uv sync --extra all

# Evaluate with a specific model (backend auto-detected)
mercury-eval gpt-4.1                                    # full evaluation
mercury-eval gemini-2.5-pro --timeout 120               # timeout
mercury-eval Qwen/Qwen2.5-Coder-32B-Instruct --limit 20 # tasks limit

# Option 1 (with BigCode):
# See https://github.com/bigcode-project/bigcode-evaluation-harness/tree/main/docs#mercury
accelerate  launch --main_process_port 30003  main.py  \
    --model bigcode/starcoder2-7b   \
    --load_in_4bit   \
    --max_length_generation 2048   \
    --tasks mercury    \
    --n_samples 5  \
    --temperature 0.2  \
    --batch_size 5   \
    --allow_code_execution  \
    --save_generations  \
    --metric_output_path starcoder2-7b-mercury-result.json

# Option 2 (this library):
import os
os.environ["OPENAI_API_KEY"] = 'YOUR_OPENAI_KEY'

# Instantiate evaluator with model_name
# Set do_generate to True if you are going to load the specific language model during evaluator initialization.
from src import evaluator as Evaluator
evaluator = Evaluator.DistributeWiseEvaluator(model_name_or_path='openai/gpt-3.5-turbo-1106', do_generate=True)

# Generate code samples
evaluator.generate(num_samples_per_task=1)

# Evaluate code samples using the Mercury benchmark
evaluator.evaluate(num_samples_per_task=1)

Benchmark Visualization

mercury_benchmark

Citation

@article{du2024mercury, 
    title={Mercury: A Code Efficiency Benchmark for Code Large Language Models},
    volume={37}, 
    journal={Advances in Neural Information Processing Systems},
    author={Du, Mingzhe and Tuan, Luu Anh and Ji, Bin and Liu, Qian and Ng, See-Kiong}, 
    year={2024}, 
}

Questions?

Should you have any questions regarding this paper, please feel free to email us (mingzhe@nus.edu.sg). Thank you for your attention!

Mercury

Install / Use

README