SkillAgentSearch skills...

Evalscope

A streamlined and customizable framework for efficient large model (LLM, VLM, AIGC) evaluation and performance benchmarking.

Install / Use

/learn @modelscope/Evalscope
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<p align="center"> <br> <img src="docs/en/_static/images/evalscope_logo.png"/> <br> <p> <p align="center"> <a href="README_zh.md">中文</a> &nbsp | &nbsp English &nbsp </p> <p align="center"> <img src="https://img.shields.io/badge/python-%E2%89%A53.10-5be.svg"> <a href="https://badge.fury.io/py/evalscope"><img src="https://badge.fury.io/py/evalscope.svg" alt="PyPI version" height="18"></a> <a href="https://pypi.org/project/evalscope"><img alt="PyPI - Downloads" src="https://static.pepy.tech/badge/evalscope"></a> <a href="https://github.com/modelscope/evalscope/pulls"><img src="https://img.shields.io/badge/PR-welcome-55EB99.svg"></a> <a href='https://evalscope.readthedocs.io/en/latest/?badge=latest'><img src='https://readthedocs.org/projects/evalscope/badge/?version=latest' alt='Documentation Status' /></a> <p> <p align="center"> <a href="https://evalscope.readthedocs.io/zh-cn/latest/"> 📖 中文文档</a> &nbsp | &nbsp <a href="https://evalscope.readthedocs.io/en/latest/"> 📖 English Documentation</a> <p>

⭐ If you like this project, please click the "Star" button in the upper right corner to support us. Your support is our motivation to move forward!

📝 Introduction

EvalScope is a powerful and easily extensible model evaluation framework created by the ModelScope Community, aiming to provide a one-stop evaluation solution for large model developers.

Whether you want to evaluate the general capabilities of models, conduct multi-model performance comparisons, or need to stress test models, EvalScope can meet your needs.

✨ Key Features

  • 📚 Comprehensive Evaluation Benchmarks: Built-in multiple industry-recognized evaluation benchmarks including MMLU, C-Eval, GSM8K, and more.
  • 🧩 Multi-modal and Multi-domain Support: Supports evaluation of various model types including Large Language Models (LLM), Vision Language Models (VLM), Embedding, Reranker, AIGC, and more.
  • 🚀 Multi-backend Integration: Seamlessly integrates multiple evaluation backends including OpenCompass, VLMEvalKit, RAGEval to meet different evaluation needs.
  • ⚡ Inference Performance Testing: Provides powerful model service stress testing tools, supporting multiple performance metrics such as TTFT, TPOT.
  • 📊 Interactive Reports: Provides WebUI visualization interface, supporting multi-dimensional model comparison, report overview and detailed inspection.
  • ⚔️ Arena Mode: Supports multi-model battles (Pairwise Battle), intuitively ranking and evaluating models.
  • 🔧 Highly Extensible: Developers can easily add custom datasets, models and evaluation metrics.
<details><summary>🏛️ Overall Architecture</summary> <p align="center"> <img src="https://sail-moe.oss-cn-hangzhou.aliyuncs.com/yunlin/images/evalscope/doc/EvalScope%E6%9E%B6%E6%9E%84%E5%9B%BE.png" style="width: 70%;"> <br>EvalScope Overall Architecture. </p>
  1. Input Layer

    • Model Sources: API models (OpenAI API), Local models (ModelScope)
    • Datasets: Standard evaluation benchmarks (MMLU/GSM8k etc.), Custom data (MCQ/QA)
  2. Core Functions

    • Multi-backend Evaluation: Native backend, OpenCompass, MTEB, VLMEvalKit, RAGAS
    • Performance Monitoring: Supports multiple model service APIs and data formats, tracking TTFT/TPOP and other metrics
    • Tool Extensions: Integrates Tool-Bench, Needle-in-a-Haystack, etc.
  3. Output Layer

    • Structured Reports: Supports JSON, Table, Logs
    • Visualization Platform: Supports Gradio, Wandb, SwanLab
</details>

🎉 What's New

[!IMPORTANT] Version 1.0 Refactoring

Version 1.0 introduces a major overhaul of the evaluation framework, establishing a new, more modular and extensible API layer under evalscope/api. Key improvements include standardized data models for benchmarks, samples, and results; a registry-based design for components such as benchmarks and metrics; and a rewritten core evaluator that orchestrates the new architecture. Existing benchmark adapters have been migrated to this API, resulting in cleaner, more consistent, and easier-to-maintain implementations.

  • 🔥 [2026.03.09] Added support for evaluation progress tracking and HTML format visualization report generation.
  • 🔥 [2026.03.02] Added support for Anthropic Claude API evaluation. Use --eval-type anthropic_api to evaluate models via Anthropic API service.
  • 🔥 [2026.02.03] Comprehensive update to dataset documentation, adding data statistics, data samples, usage instructions and more. Refer to Supported Datasets
  • 🔥 [2026.01.13] Added support for Embedding and Rerank model service stress testing. Refer to the usage documentation.
  • 🔥 [2025.12.26] Added support for Terminal-Bench-2.0, which evaluates AI Agent performance on 89 real-world multi-step terminal tasks. Refer to the usage documentation.
  • 🔥 [2025.12.18] Added support for SLA auto-tuning model API services, automatically testing the maximum concurrency of model services under specific latency, TTFT, and throughput conditions. Refer to the usage documentation.
  • 🔥 [2025.12.16] Added support for audio evaluation benchmarks such as Fleurs, LibriSpeech; added support for multilingual code evaluation benchmarks such as MultiplE, MBPP.
  • 🔥 [2025.12.02] Added support for custom multimodal VQA evaluation; refer to the usage documentation. Added support for visualizing model service stress testing in ClearML; refer to the usage documentation.
  • 🔥 [2025.11.26] Added support for OpenAI-MRCR, GSM8K-V, MGSM, MicroVQA, IFBench, SciCode benchmarks.
  • 🔥 [2025.11.18] Added support for custom Function-Call (tool invocation) datasets to test whether models can timely and correctly call tools. Refer to the usage documentation.
  • 🔥 [2025.11.14] Added support for SWE-bench_Verified, SWE-bench_Lite, SWE-bench_Verified_mini code evaluation benchmarks. Refer to the usage documentation.
  • 🔥 [2025.11.12] Added pass@k, vote@k, pass^k and other metric aggregation methods; added support for multimodal evaluation benchmarks such as A_OKVQA, CMMU, ScienceQA, V*Bench.
  • 🔥 [2025.11.07] Added support for τ²-bench, an extended and enhanced version of τ-bench that includes a series of code fixes and adds telecom domain troubleshooting scenarios. Refer to the usage documentation.
  • 🔥 [2025.10.30] Added support for BFCL-v4, enabling evaluation of agent capabilities including web search and long-term memory. See the usage documentation.
  • 🔥 [2025.10.27] Added support for LogiQA, HaluEval, MathQA, MRI-QA, PIQA, QASC, CommonsenseQA and other evaluation benchmarks. Thanks to @penguinwang96825 for the code implementation.
  • 🔥 [2025.10.26] Added support for Conll-2003, CrossNER, Copious, GeniaNER, HarveyNER, MIT-Movie-Trivia, MIT-Restaurant, OntoNotes5, WNUT2017 and other Named Entity Recognition evaluation benchmarks. Thanks to @penguinwang96825 for the code implementation.
  • 🔥 [2025.10.21] Optimized sandbox environment usage in code evaluation, supporting both local and remote operation modes. For details, refer to the documentation.
  • 🔥 [2025.10.20] Added support for evaluation benchmarks including PolyMath, SimpleVQA, MathVerse, MathVision, AA-LCR; optimized evalscope perf performance to align with vLLM Bench. For details, refer to the documentation.
  • 🔥 [2025.10.14] Added support for OCRBench, OCRBench-v2, DocVQA, InfoVQA, ChartQA, and BLINK multimodal image-text evaluation benchmarks.
  • 🔥 [2025.09.22] Code evaluation benchmarks (HumanEval, LiveCodeBench) now support running in a sandbox environment. To use this feature, please install ms-enclave first.
  • 🔥 [2025.09.19] Added support for multimodal image-text evaluation benchmarks including RealWorldQA, AI2D, MMStar, MMBench, and OmniBench, as well as pure text evaluation benchmarks such as Multi-IF, HealthBench, and AMC.
  • 🔥 [2025.09.05] Added support for vision-language multimodal model evaluation tasks, such as MathVista and MMMU. For more supported datasets, please refer to the documentation.
  • 🔥 [2025.09.04] Added support for image editing task evaluation, including the GEdit-Bench benchmark. For usage instructions, refer to the documentation.
  • 🔥 [2025.08.22] Version 1.0 Refactoring. Break changes, please refer to.
<details><summary>More</summary>
  • 🔥 [2025.07.18] The model stress testing now supports randomly generating image-text data for multimodal model evaluation. For usage instructions
View on GitHub
GitHub Stars2.5k
CategoryDevelopment
Updated20h ago
Forks288

Languages

Python

Security Score

100/100

Audited on Mar 21, 2026

No findings