<img src="docs/en/_static/images/evalscope_logo.png"/> <a href="README_zh.md">中文</a> &nbsp ｜ &nbsp English &nbsp <img src="https://img.shields.io/badge/python-%E2%89%A53.10-5be.svg"> <a href="https://badge.fury.io/py/evalscope"><img src="https://badge.fury.io/py/evalscope.svg" alt="PyPI version" height="18"></a> <a href="https://pypi.org/project/evalscope"><img alt="PyPI - Downloads" src="https://static.pepy.tech/badge/evalscope"></a> <a href="https://github.com/modelscope/evalscope/pulls"><img src="https://img.shields.io/badge/PR-welcome-55EB99.svg"></a> <a href='https://evalscope.readthedocs.io/en/latest/?badge=latest'><img src='https://readthedocs.org/projects/evalscope/badge/?version=latest' alt='Documentation Status' /></a> <a href="https://evalscope.readthedocs.io/zh-cn/latest/"> 📖 中文文档</a> &nbsp ｜ &nbsp <a href="https://evalscope.readthedocs.io/en/latest/"> 📖 English Documentation</a>

⭐ If you like this project, please click the "Star" button in the upper right corner to support us. Your support is our motivation to move forward!

📝 Introduction

EvalScope is a powerful and easily extensible model evaluation framework created by the ModelScope Community, aiming to provide a one-stop evaluation solution for large model developers.

Whether you want to evaluate the general capabilities of models, conduct multi-model performance comparisons, or need to stress test models, EvalScope can meet your needs.

✨ Key Features

📚 Comprehensive Evaluation Benchmarks: Built-in multiple industry-recognized evaluation benchmarks including MMLU, C-Eval, GSM8K, and more.
🧩 Multi-modal and Multi-domain Support: Supports evaluation of various model types including Large Language Models (LLM), Vision Language Models (VLM), Embedding, Reranker, AIGC, and more.
🚀 Multi-backend Integration: Seamlessly integrates multiple evaluation backends including OpenCompass, VLMEvalKit, RAGEval to meet different evaluation needs.
⚡ Inference Performance Testing: Provides powerful model service stress testing tools, supporting multiple performance metrics such as TTFT, TPOT.
📊 Interactive Reports: Provides WebUI visualization interface, supporting multi-dimensional model comparison, report overview and detailed inspection.
⚔️ Arena Mode: Supports multi-model battles (Pairwise Battle), intuitively ranking and evaluating models.
🔧 Highly Extensible: Developers can easily add custom datasets, models and evaluation metrics.

<details><summary>🏛️ Overall Architecture</summary> <img src="https://sail-moe.oss-cn-hangzhou.aliyuncs.com/yunlin/images/evalscope/doc/EvalScope%E6%9E%B6%E6%9E%84%E5%9B%BE.png" style="width: 70%;"> EvalScope Overall Architecture.

Input Layer
- Model Sources: API models (OpenAI API), Local models (ModelScope)
- Datasets: Standard evaluation benchmarks (MMLU/GSM8k etc.), Custom data (MCQ/QA)
Core Functions
- Multi-backend Evaluation: Native backend, OpenCompass, MTEB, VLMEvalKit, RAGAS
- Performance Monitoring: Supports multiple model service APIs and data formats, tracking TTFT/TPOP and other metrics
- Tool Extensions: Integrates Tool-Bench, Needle-in-a-Haystack, etc.
Output Layer
- Structured Reports: Supports JSON, Table, Logs
- Visualization Platform: Supports Gradio, Wandb, SwanLab

</details>

🎉 What's New

[!IMPORTANT] Version 1.0 Refactoring

Version 1.0 introduces a major overhaul of the evaluation framework, establishing a new, more modular and extensible API layer under evalscope/api. Key improvements include standardized data models for benchmarks, samples, and results; a registry-based design for components such as benchmarks and metrics; and a rewritten core evaluator that orchestrates the new architecture. Existing benchmark adapters have been migrated to this API, resulting in cleaner, more consistent, and easier-to-maintain implementations.

🔥 [2026.03.09] Added support for evaluation progress tracking and HTML format visualization report generation.
🔥 [2026.03.02] Added support for Anthropic Claude API evaluation. Use --eval-type anthropic_api to evaluate models via Anthropic API service.
🔥 [2026.02.03] Comprehensive update to dataset documentation, adding data statistics, data samples, usage instructions and more. Refer to Supported Datasets
🔥 [2026.01.13] Added support for Embedding and Rerank model service stress testing. Refer to the usage documentation.
🔥 [2025.12.26] Added support for Terminal-Bench-2.0, which evaluates AI Agent performance on 89 real-world multi-step terminal tasks. Refer to the usage documentation.
🔥 [2025.12.18] Added support for SLA auto-tuning model API services, automatically testing the maximum concurrency of model services under specific latency, TTFT, and throughput conditions. Refer to the usage documentation.
🔥 [2025.12.16] Added support for audio evaluation benchmarks such as Fleurs, LibriSpeech; added support for multilingual code evaluation benchmarks such as MultiplE, MBPP.
🔥 [2025.12.02] Added support for custom multimodal VQA evaluation; refer to the usage documentation. Added support for visualizing model service stress testing in ClearML; refer to the usage documentation.
🔥 [2025.11.26] Added support for OpenAI-MRCR, GSM8K-V, MGSM, MicroVQA, IFBench, SciCode benchmarks.
🔥 [2025.11.18] Added support for custom Function-Call (tool invocation) datasets to test whether models can timely and correctly call tools. Refer to the usage documentation.
🔥 [2025.11.14] Added support for SWE-bench_Verified, SWE-bench_Lite, SWE-bench_Verified_mini code evaluation benchmarks. Refer to the usage documentation.
🔥 [2025.11.12] Added pass@k, vote@k, pass^k and other metric aggregation methods; added support for multimodal evaluation benchmarks such as A_OKVQA, CMMU, ScienceQA, V*Bench.
🔥 [2025.11.07] Added support for τ²-bench, an extended and enhanced version of τ-bench that includes a series of code fixes and adds telecom domain troubleshooting scenarios. Refer to the usage documentation.
🔥 [2025.10.30] Added support for BFCL-v4, enabling evaluation of agent capabilities including web search and long-term memory. See the usage documentation.
🔥 [2025.10.27] Added support for LogiQA, HaluEval, MathQA, MRI-QA, PIQA, QASC, CommonsenseQA and other evaluation benchmarks. Thanks to @penguinwang96825 for the code implementation.
🔥 [2025.10.26] Added support for Conll-2003, CrossNER, Copious, GeniaNER, HarveyNER, MIT-Movie-Trivia, MIT-Restaurant, OntoNotes5, WNUT2017 and other Named Entity Recognition evaluation benchmarks. Thanks to @penguinwang96825 for the code implementation.
🔥 [2025.10.21] Optimized sandbox environment usage in code evaluation, supporting both local and remote operation modes. For details, refer to the documentation.
🔥 [2025.10.20] Added support for evaluation benchmarks including PolyMath, SimpleVQA, MathVerse, MathVision, AA-LCR; optimized evalscope perf performance to align with vLLM Bench. For details, refer to the documentation.
🔥 [2025.10.14] Added support for OCRBench, OCRBench-v2, DocVQA, InfoVQA, ChartQA, and BLINK multimodal image-text evaluation benchmarks.
🔥 [2025.09.22] Code evaluation benchmarks (HumanEval, LiveCodeBench) now support running in a sandbox environment. To use this feature, please install ms-enclave first.
🔥 [2025.09.19] Added support for multimodal image-text evaluation benchmarks including RealWorldQA, AI2D, MMStar, MMBench, and OmniBench, as well as pure text evaluation benchmarks such as Multi-IF, HealthBench, and AMC.
🔥 [2025.09.05] Added support for vision-language multimodal model evaluation tasks, such as MathVista and MMMU. For more supported datasets, please refer to the documentation.
🔥 [2025.09.04] Added support for image editing task evaluation, including the GEdit-Bench benchmark. For usage instructions, refer to the documentation.
🔥 [2025.08.22] Version 1.0 Refactoring. Break changes, please refer to.

🔥 [2025.07.18] The model stress testing now supports randomly generating image-text data for multimodal model evaluation. For usage instructions

Evalscope

Install / Use

README

📝 Introduction

✨ Key Features

🎉 What's New