Evalscope
A streamlined and customizable framework for efficient large model (LLM, VLM, AIGC) evaluation and performance benchmarking.
Install / Use
/learn @modelscope/EvalscopeREADME
⭐ If you like this project, please click the "Star" button in the upper right corner to support us. Your support is our motivation to move forward!
📝 Introduction
EvalScope is a powerful and easily extensible model evaluation framework created by the ModelScope Community, aiming to provide a one-stop evaluation solution for large model developers.
Whether you want to evaluate the general capabilities of models, conduct multi-model performance comparisons, or need to stress test models, EvalScope can meet your needs.
✨ Key Features
- 📚 Comprehensive Evaluation Benchmarks: Built-in multiple industry-recognized evaluation benchmarks including MMLU, C-Eval, GSM8K, and more.
- 🧩 Multi-modal and Multi-domain Support: Supports evaluation of various model types including Large Language Models (LLM), Vision Language Models (VLM), Embedding, Reranker, AIGC, and more.
- 🚀 Multi-backend Integration: Seamlessly integrates multiple evaluation backends including OpenCompass, VLMEvalKit, RAGEval to meet different evaluation needs.
- ⚡ Inference Performance Testing: Provides powerful model service stress testing tools, supporting multiple performance metrics such as TTFT, TPOT.
- 📊 Interactive Reports: Provides WebUI visualization interface, supporting multi-dimensional model comparison, report overview and detailed inspection.
- ⚔️ Arena Mode: Supports multi-model battles (Pairwise Battle), intuitively ranking and evaluating models.
- 🔧 Highly Extensible: Developers can easily add custom datasets, models and evaluation metrics.
-
Input Layer
- Model Sources: API models (OpenAI API), Local models (ModelScope)
- Datasets: Standard evaluation benchmarks (MMLU/GSM8k etc.), Custom data (MCQ/QA)
-
Core Functions
- Multi-backend Evaluation: Native backend, OpenCompass, MTEB, VLMEvalKit, RAGAS
- Performance Monitoring: Supports multiple model service APIs and data formats, tracking TTFT/TPOP and other metrics
- Tool Extensions: Integrates Tool-Bench, Needle-in-a-Haystack, etc.
-
Output Layer
- Structured Reports: Supports JSON, Table, Logs
- Visualization Platform: Supports Gradio, Wandb, SwanLab
🎉 What's New
[!IMPORTANT] Version 1.0 Refactoring
Version 1.0 introduces a major overhaul of the evaluation framework, establishing a new, more modular and extensible API layer under
evalscope/api. Key improvements include standardized data models for benchmarks, samples, and results; a registry-based design for components such as benchmarks and metrics; and a rewritten core evaluator that orchestrates the new architecture. Existing benchmark adapters have been migrated to this API, resulting in cleaner, more consistent, and easier-to-maintain implementations.
- 🔥 [2026.03.09] Added support for evaluation progress tracking and HTML format visualization report generation.
- 🔥 [2026.03.02] Added support for Anthropic Claude API evaluation. Use
--eval-type anthropic_apito evaluate models via Anthropic API service. - 🔥 [2026.02.03] Comprehensive update to dataset documentation, adding data statistics, data samples, usage instructions and more. Refer to Supported Datasets
- 🔥 [2026.01.13] Added support for Embedding and Rerank model service stress testing. Refer to the usage documentation.
- 🔥 [2025.12.26] Added support for Terminal-Bench-2.0, which evaluates AI Agent performance on 89 real-world multi-step terminal tasks. Refer to the usage documentation.
- 🔥 [2025.12.18] Added support for SLA auto-tuning model API services, automatically testing the maximum concurrency of model services under specific latency, TTFT, and throughput conditions. Refer to the usage documentation.
- 🔥 [2025.12.16] Added support for audio evaluation benchmarks such as Fleurs, LibriSpeech; added support for multilingual code evaluation benchmarks such as MultiplE, MBPP.
- 🔥 [2025.12.02] Added support for custom multimodal VQA evaluation; refer to the usage documentation. Added support for visualizing model service stress testing in ClearML; refer to the usage documentation.
- 🔥 [2025.11.26] Added support for OpenAI-MRCR, GSM8K-V, MGSM, MicroVQA, IFBench, SciCode benchmarks.
- 🔥 [2025.11.18] Added support for custom Function-Call (tool invocation) datasets to test whether models can timely and correctly call tools. Refer to the usage documentation.
- 🔥 [2025.11.14] Added support for SWE-bench_Verified, SWE-bench_Lite, SWE-bench_Verified_mini code evaluation benchmarks. Refer to the usage documentation.
- 🔥 [2025.11.12] Added
pass@k,vote@k,pass^kand other metric aggregation methods; added support for multimodal evaluation benchmarks such as A_OKVQA, CMMU, ScienceQA, V*Bench. - 🔥 [2025.11.07] Added support for τ²-bench, an extended and enhanced version of τ-bench that includes a series of code fixes and adds telecom domain troubleshooting scenarios. Refer to the usage documentation.
- 🔥 [2025.10.30] Added support for BFCL-v4, enabling evaluation of agent capabilities including web search and long-term memory. See the usage documentation.
- 🔥 [2025.10.27] Added support for LogiQA, HaluEval, MathQA, MRI-QA, PIQA, QASC, CommonsenseQA and other evaluation benchmarks. Thanks to @penguinwang96825 for the code implementation.
- 🔥 [2025.10.26] Added support for Conll-2003, CrossNER, Copious, GeniaNER, HarveyNER, MIT-Movie-Trivia, MIT-Restaurant, OntoNotes5, WNUT2017 and other Named Entity Recognition evaluation benchmarks. Thanks to @penguinwang96825 for the code implementation.
- 🔥 [2025.10.21] Optimized sandbox environment usage in code evaluation, supporting both local and remote operation modes. For details, refer to the documentation.
- 🔥 [2025.10.20] Added support for evaluation benchmarks including PolyMath, SimpleVQA, MathVerse, MathVision, AA-LCR; optimized evalscope perf performance to align with vLLM Bench. For details, refer to the documentation.
- 🔥 [2025.10.14] Added support for OCRBench, OCRBench-v2, DocVQA, InfoVQA, ChartQA, and BLINK multimodal image-text evaluation benchmarks.
- 🔥 [2025.09.22] Code evaluation benchmarks (HumanEval, LiveCodeBench) now support running in a sandbox environment. To use this feature, please install ms-enclave first.
- 🔥 [2025.09.19] Added support for multimodal image-text evaluation benchmarks including RealWorldQA, AI2D, MMStar, MMBench, and OmniBench, as well as pure text evaluation benchmarks such as Multi-IF, HealthBench, and AMC.
- 🔥 [2025.09.05] Added support for vision-language multimodal model evaluation tasks, such as MathVista and MMMU. For more supported datasets, please refer to the documentation.
- 🔥 [2025.09.04] Added support for image editing task evaluation, including the GEdit-Bench benchmark. For usage instructions, refer to the documentation.
- 🔥 [2025.08.22] Version 1.0 Refactoring. Break changes, please refer to.
- 🔥 [2025.07.18] The model stress testing now supports randomly generating image-text data for multimodal model evaluation. For usage instructions
