SkillAgentSearch skills...

Mercury

Code Efficiency Benchmark

Install / Use

/learn @Elfsong/Mercury
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Mercury: A Code Efficiency Benchmark for Code Large Language Models 🪐

arXiv HuggingFace HuggingFace

  • Welcome to Mercury!
  • Mercury is the first code efficiency benchmark designed for LLM code synthesis tasks.
  • It consists of 1,889 programming tasks covering diverse difficulty levels, along with test case generators that produce unlimited cases for comprehensive evaluation.

[March 6, 2026] Release Mercury_Eval for Mercury Evaluation!

[October 8, 2024] Mercury has been accepted to NeurIPS 2024 🌟

[September 20, 2024] We release a way bigger dataset Venus, which supports more languages. It also provides Memory measurement other than Time.

[July 10, 2024] We are building Code Arena now for more efficient Code LLMs evaluation!

[June 24, 2024] We are currently working on the Multilingual Mercury 🚧

[May 26, 2024] Mercury is now available on BigCode 🌟

Mercury Datasets Access

We publish and maintain our datasets at Mercury@HF

Mercury_Eval

How to use Mercury Evaluation

# Option 0 (Mercury Evaluation)
git clone https://github.com/Elfsong/Mercury_Eval.git
cd Mercury_Eval
uv sync --extra all

# Evaluate with a specific model (backend auto-detected)
mercury-eval gpt-4.1                                    # full evaluation
mercury-eval gemini-2.5-pro --timeout 120               # timeout
mercury-eval Qwen/Qwen2.5-Coder-32B-Instruct --limit 20 # tasks limit
# Option 1 (with BigCode):
# See https://github.com/bigcode-project/bigcode-evaluation-harness/tree/main/docs#mercury
accelerate  launch --main_process_port 30003  main.py  \
    --model bigcode/starcoder2-7b   \
    --load_in_4bit   \
    --max_length_generation 2048   \
    --tasks mercury    \
    --n_samples 5  \
    --temperature 0.2  \
    --batch_size 5   \
    --allow_code_execution  \
    --save_generations  \
    --metric_output_path starcoder2-7b-mercury-result.json
# Option 2 (this library):
import os
os.environ["OPENAI_API_KEY"] = 'YOUR_OPENAI_KEY'

# Instantiate evaluator with model_name
# Set do_generate to True if you are going to load the specific language model during evaluator initialization.
from src import evaluator as Evaluator
evaluator = Evaluator.DistributeWiseEvaluator(model_name_or_path='openai/gpt-3.5-turbo-1106', do_generate=True)

# Generate code samples
evaluator.generate(num_samples_per_task=1)

# Evaluate code samples using the Mercury benchmark
evaluator.evaluate(num_samples_per_task=1)

Benchmark Visualization

mercury_benchmark

Citation

@article{du2024mercury, 
    title={Mercury: A Code Efficiency Benchmark for Code Large Language Models},
    volume={37}, 
    journal={Advances in Neural Information Processing Systems},
    author={Du, Mingzhe and Tuan, Luu Anh and Ji, Bin and Liu, Qian and Ng, See-Kiong}, 
    year={2024}, 
}

Questions?

Should you have any questions regarding this paper, please feel free to email us (mingzhe@nus.edu.sg). Thank you for your attention!

View on GitHub
GitHub Stars87
CategoryDevelopment
Updated1mo ago
Forks10

Languages

Jupyter Notebook

Security Score

85/100

Audited on Mar 6, 2026

No findings