Awesome LLM Evaluation

Evaluation of LLM and LLM based Systems

Compendium of LLM Evaluation methods

Introduction

The aim of this compendium is to assist academics and industry professionals in creating effective evaluation suites tailored to their specific needs. It does so by reviewing the top industry practices for assessing large language models (LLMs) and their applications. This work goes beyond merely cataloging benchmarks and evaluation studies; it encompasses a comprehensive overview of all effective and practical evaluation techniques, including those embedded within papers that primarily introduce new LLM methodologies and tasks. I plan to periodically update this survey with any noteworthy and shareable evaluation methods that I come across. I aim to create a resource that will enable anyone with queries—whether it's about evaluating a large language model (LLM) or an LLM application for specific tasks, determining the best methods to assess LLM effectiveness, or understanding how well an LLM performs in a particular domain—to easily find all the relevant information needed for these tasks. Additionally, I want to highlight various methods for evaluating the evaluation tasks themselves, to ensure that these evaluations align effectively with business or academic objectives.

My view on LLM Evaluation: Deck 24, and SF Big Analytics and AICamp 24 video Analytics Vidhya (Data Phoenix Mar 5 24) (by Andrei Lopatenko)

Adjacent compendium on LLM, Search and Recommender engines

The github repository

Evals are surprisingly often all you need

Reviews and Surveys
Leaderboards and Arenas
Evaluation Software
LLM Evaluation articles in tech media and blog posts from companies
Frontier models
Large benchmarks
Evaluation of evaluation, Evaluation theory, evaluation methods, analysis of evaluation
Long Comprehensive Studies
HITL (Human in the Loop)
LLM as Judge
LLM Evaluation
LLM Systems
Other collections
Citation

Reviews and Surveys

Order in the Evaluation Court: A Critical Analysis of NLG Evaluation Trends, Jan 2026, arxiv
Benchmark^2: Systematic Evaluation of LLM Benchmarks, Jan 2026, arxiv
Toward an evaluation science for generative AI systems, Mar 2025, arxiv
Benchmark Evaluations, Applications, and Challenges of Large Vision Language Models: A Survey, UMD, Jan 2025, arxiv
AI Benchmarks and Datasets for LLM Evaluation, Dec 2024, arxiv, a survey of many LLM benchmarks
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods, Dec 2024, arxiv
A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations, EMNLP 2024, ACLAnthology
A Survey on Evaluation of Multimodal Large Language Models, aug 2024, arxiv
A Survey of Useful LLM Evaluation, Jun 2024, arxiv
Evaluating Large Language Models: A Comprehensive Survey , Oct 2023 arxiv:
A Survey on Evaluation of Large Language Models Jul 2023 arxiv:
Through the Lens of Core Competency: Survey on Evaluation of Large Language Models, Aug 2023 , arxiv:
for industry-specific surveys of evaluation methods for industries such as medical, see in respective parts of this compendium

Leaderboards and Arenas

New Hard Leaderboard by HuggingFace leaderboard description, blog post
MathArena Evaluating LLMs on Uncontaminated Math Competitions Evaluation code
ViDoRe Benchmark V2: Raising the Bar for Visual Retrieval, The Visual Document Retrieval Benchmark, Mar 2025, HuggingSpace See leaderboard in the document
The FACTS Grounding Leaderboard: Benchmarking LLMs' Ability to Ground Responses to Long-Form Input, DeepMind, Jan 2025, arxiv Leaderboard
LMSys Arena (explanation:)
Aider Polyglot, code edit benchmark, Aider Polyglot
Salesforce's Contextual Bench leaderboard hugging face an overview of how different LLMs perform across a variety of contextual tasks,
GAIA leaderboard, GAIA is a benchmark developed by Meta, HuggingFace to measure AGI Assistants, see GAIA: a benchmark for General AI Assistants
WebQA - Multimodal and Multihop QA, by WebQA WebQA leaderboard
ArenaHard Leaderboard Paper: From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline, UC Berkeley, Jun 2024, arxiv github repo ArenaHard benchmark
OpenGPT-X Multi- Lingual European LLM Leaderboard, evaluation of LLMs for many European languages - on HuggingFace
AllenAI's ZeroEval LeaderBoard benchmark: ZeroEval from AllenAI unified framework for evaluating (large) language models on various reasoning tasks
OpenLLM Leaderboard
MTEB
SWE Bench
AlpacaEval leaderboard Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators, Apr 2024, arxiv code
Open Medical LLM Leaderboard from HF Explanation
Gorilla, Berkeley function calling Leaderboard Explanation
WildBench WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild
Enterprise Scenarios, Patronus
Vectara Hallucination Leaderboard
Ray/Anyscale's LLM Performance Leaderboard (explanation:)
Hugging Face LLM Performance [hugging face leaderboard](https://

LLMEvaluation

Install / Use

README