Evalplus
Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024
Install / Use
/learn @evalplus/EvalplusREADME
EvalPlus(📖) => 📚
<p align="center">
<a href="https://evalplus.github.io"><img src="https://img.shields.io/badge/%F0%9F%8F%86-leaderboard-8A2BE2"></a>
<a href="https://openreview.net/forum?id=1qvx610Cu7"><img src="https://img.shields.io/badge/EvalPlus-NeurIPS'23-a55fed.svg"></a>
<a href="https://openreview.net/forum?id=IBCBMeAhmC"><img src="https://img.shields.io/badge/EvalPerf-COLM'24-a55fed.svg"></a>
<a href="https://huggingface.co/evalplus/"><img src="https://img.shields.io/badge/🤗%20Hugging%20Face-evalplus-%23ff8811.svg"></a>
<a href="https://pypi.org/project/evalplus/"><img src="https://img.shields.io/pypi/v/evalplus?color=g"></a>
<a href="https://hub.docker.com/r/ganler/evalplus" title="Docker"><img src="https://img.shields.io/docker/image-size/ganler/evalplus"></a>
</p>
<p align="center">
<a href="#-about">📙About</a> •
<a href="#-quick-start">🔥Quick Start</a> •
<a href="#-llm-backends">🚀LLM Backends</a> •
<a href="#-documents">📚Documents</a> •
<a href="#-citation">📜Citation</a> •
<a href="#-acknowledgement">🙏Acknowledgement</a>
</p>
📢 News
Who's using EvalPlus datasets? EvalPlus has been used by various LLM teams, including:
- Meta Llama 3.1 and 3.3
- Allen AI TÜLU 1/2/3
- Qwen2.5-Coder
- CodeQwen 1.5
- DeepSeek-Coder V2
- Qwen2
- Snowflake Arctic
- StarCoder2
- Magicoder
- WizardCoder
Below tracks the notable updates of EvalPlus:
- [2024-10-20
v0.3.1]: EvalPlusv0.3.1is officially released! Highlights: (i) Code efficiency evaluation via EvalPerf, (ii) one command to run all: generation + post-processing + evaluation, (iii) support for more inference backends such as Google Gemini & Anthropic, etc. - [2024-06-09 pre
v0.3.0]: Improved ground-truth solutions for MBPP+ tasks (IDs: 459, 102, 559). Thanks to EvalArena. - [2024-04-17 pre
v0.3.0]: MBPP+ is upgraded tov0.2.0by removing some broken tasks (399 -> 378 tasks). ~4pp pass@1 improvement could be expected.
- (
v0.2.1) You can use EvalPlus datasets via bigcode-evaluation-harness! HumanEval+ oracle fixes (32). - (
v0.2.0) MBPP+ is released! HumanEval contract & input fixes (0/3/9/148/114/1/2/99/28/32/35/160). - (
v0.1.7) Leaderboard release; HumanEval+ contract and input fixes (32/166/126/6) - (
v0.1.6) Configurable and by-default-conservative timeout settings; HumanEval+ contract & ground-truth fixes (129/148/75/53/0/3/9/140) - (
v0.1.5) HumanEval+ mini is released for ultra-fast evaluation when you have too many samples! - (
v0.1.1) Optimizing user experiences: evaluation speed, PyPI package, Docker, etc. - (
v0.1.0) HumanEval+ is released!
📙 About
EvalPlus is a rigorous evaluation framework for LLM4Code, with:
- ✨ HumanEval+: 80x more tests than the original HumanEval!
- ✨ MBPP+: 35x more tests than the original MBPP!
- ✨ EvalPerf: evaluating the efficiency of LLM-generated code!
- ✨ Framework: our packages/images/tools can easily and safely evaluate LLMs on above benchmarks.
Why EvalPlus?
- ✨ Precise evaluation: See our leaderboard for latest LLM rankings before & after rigorous evaluation.
- ✨ Coding rigorousness: Look at the score differences! esp. before & after using EvalPlus tests! Less drop means more rigorousness in code generation; while a bigger drop means the generated code tends to be fragile.
- ✨ Code efficiency: Beyond correctness, our EvalPerf dataset evaluates the efficiency of LLM-generated code via performance-exercising coding tasks and test inputs.
Want to know more details? Read our papers & materials!
- EvalPlus: NeurIPS'23 paper, Slides, Poster, Leaderboard
- EvalPerf: COLM'24 paper, Poster, Documentation, Leaderboard
🔥 Quick Start
Code Correctness Evaluation: HumanEval(+) or MBPP(+)
pip install --upgrade "evalplus[vllm] @ git+https://github.com/evalplus/evalplus"
# Or `pip install "evalplus[vllm]" --upgrade` for the latest stable release
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
--dataset [humaneval|mbpp] \
--backend vllm \
--greedy
<details><summary>🛡️ Safe code execution within Docker <i>:: click to expand ::</i></summary>
<div>
# Local generation
evalplus.codegen --model "ise-uiuc/Magicoder-S-DS-6.7B" \
--dataset humaneval \
--backend vllm \
--greedy
# Code execution within Docker
docker run --rm --pull=always -v $(pwd)/evalplus_results:/app ganler/evalplus:latest \
evalplus.evaluate --dataset humaneval \
--samples /app/humaneval/ise-uiuc--Magicoder-S-DS-6.7B_vllm_temp_0.0.jsonl
</div>
</details>
Code Efficiency Evaluation: EvalPerf (*nix only)
pip install --upgrade "evalplus[perf,vllm] @ git+https://github.com/evalplus/evalplus"
# Or `pip install "evalplus[perf,vllm]" --upgrade` for the latest stable release
sudo sh -c 'echo 0 > /proc/sys/kernel/perf_event_paranoid' # Enable perf
evalplus.evalperf --model "ise-uiuc/Magicoder-S-DS-6.7B" --backend vllm
<details><summary>🛡️ Safe code execution within Docker <i>:: click to expand ::</i></summary>
<div>
# Local generation
evalplus.codegen --model "ise-uiuc/Magicoder-S-DS-6.7B" \
--dataset evalperf \
--backend vllm \
--temperature 1.0 \
--n-samples 100
# Code execution within Docker
sudo sh -c 'echo 0 > /proc/sys/kernel/perf_event_paranoid' # Enable perf
docker run --cap-add PERFMON --rm --pull=always -v $(pwd)/evalplus_results:/app ganler/evalplus:latest \
evalplus.evalperf --samples /app/evalperf/ise-uiuc--Magicoder-S-DS-6.7B_vllm_temp_1.0.jsonl
</div>
</details>
🚀 LLM Backends
HuggingFace models
transformersbackend:
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
--dataset [humaneval|mbpp] \
--backend hf \
--greedy
<details><summary>Enable Flash Attention 2 <i>:: click to expand ::</i></summary> <div>[!Note]
EvalPlus uses different prompts for base and chat models. By default it is detected by
tokenizer.chat_templatewhen usinghf/vllmas backend. For other backends, only chat mode is allowed.Therefore, if your base models come with a
tokenizer.chat_template, please add--force-base-promptto avoid being evaluated in a chat mode.
# Install Flash Attention 2
pip install packaging ninja
pip install flash-attn --no-build-isolation
# Note: if you have installation problem, consider using pre-built
# wheels from https://github.com/Dao-AILab/flash-attention/releases
# Run evaluation with FA2
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
--dataset [humaneval|mbpp] \
--backend hf \
--attn-implementation [flash_attention_2|sdpa] \
--greedy
</div>
</details>
vllmbackend:
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
--dataset [humaneval|mbpp] \
--backend vllm \
--tp [TENSOR_PARALLEL_SIZE] \
--greedy
openaicompatible servers (e.g., vLLM):
# OpenAI models
export OPENAI_API_KEY="{KEY}" # https://platform.openai.com/settings/organization/api-keys
evalplus.evaluate --model "gpt-4o-2024-08-06" \
--dataset [humaneval|mbpp] \
--backend openai --greedy
# DeepSeek
export OPENAI_API_KEY="{KEY}" # https://platform.deepseek.com/api_keys
evalplus.evaluate --model "deepseek-chat" \
--dataset [humaneval|mbpp] \
--base-url https://api.deepseek.com \
--backend openai --greedy
# Grok
export OPENAI_API_KEY="{KEY}" # https://console.x.ai/
evalplus.evaluate --model "grok-beta" \
--dataset [humaneval|mbpp] \
Related Skills
gh-issues
325.6kFetch GitHub issues, spawn sub-agents to implement fixes and open PRs, then monitor and address PR review comments. Usage: /gh-issues [owner/repo] [--label bug] [--limit 5] [--milestone v1.0] [--assignee @me] [--fork user/repo] [--watch] [--interval 5] [--reviews-only] [--cron] [--dry-run] [--model glm-5] [--notify-channel -1002381931352]
node-connect
325.6kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
80.2kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
Writing Hookify Rules
80.2kThis skill should be used when the user asks to "create a hookify rule", "write a hook rule", "configure hookify", "add a hookify rule", or needs guidance on hookify rule syntax and patterns.
