LLMEvaluation
A comprehensive guide to LLM evaluation methods designed to assist in identifying the most suitable evaluation techniques for various use cases, promote the adoption of best practices in LLM assessment, and critically assess the effectiveness of these evaluation methods.
Install / Use
/learn @alopatenko/LLMEvaluationREADME
Awesome LLM Evaluation
Evaluation of LLM and LLM based Systems
Compendium of LLM Evaluation methods
Introduction
The aim of this compendium is to assist academics and industry professionals in creating effective evaluation suites tailored to their specific needs. It does so by reviewing the top industry practices for assessing large language models (LLMs) and their applications. This work goes beyond merely cataloging benchmarks and evaluation studies; it encompasses a comprehensive overview of all effective and practical evaluation techniques, including those embedded within papers that primarily introduce new LLM methodologies and tasks. I plan to periodically update this survey with any noteworthy and shareable evaluation methods that I come across. I aim to create a resource that will enable anyone with queries—whether it's about evaluating a large language model (LLM) or an LLM application for specific tasks, determining the best methods to assess LLM effectiveness, or understanding how well an LLM performs in a particular domain—to easily find all the relevant information needed for these tasks. Additionally, I want to highlight various methods for evaluating the evaluation tasks themselves, to ensure that these evaluations align effectively with business or academic objectives.
My view on LLM Evaluation: Deck 24, and SF Big Analytics and AICamp 24 video Analytics Vidhya (Data Phoenix Mar 5 24) (by Andrei Lopatenko)
Adjacent compendium on LLM, Search and Recommender engines
The github repository

Table of contents
- Reviews and Surveys
- Leaderboards and Arenas
- Evaluation Software
- LLM Evaluation articles in tech media and blog posts from companies
- Frontier models
- Large benchmarks
- Evaluation of evaluation, Evaluation theory, evaluation methods, analysis of evaluation
- Long Comprehensive Studies
- HITL (Human in the Loop)
- LLM as Judge
- LLM Evaluation
- Embeddings
- In Context Learning
- Hallucinations
- Question Answering
- Multi Turn
- Reasoning
- Multi-Lingual
- Multi-Modal
- Instruction Following
- Ethical AI
- Biases
- Safe AI
- Cybersecurity
- Code Generating LLMs
- Summarization
- LLM quality (generic methods: overfitting, redundant layers etc)
- Inference Performance
- Agent LLM architectures
- AGI Evaluation
- Long Text Generation
- Document Understanding
- Graph Understandings
- Reward Models
- Various unclassified tasks
- LLM Systems
- Other collections
- Citation
Reviews and Surveys
- Order in the Evaluation Court: A Critical Analysis of NLG Evaluation Trends, Jan 2026, arxiv
- Benchmark^2: Systematic Evaluation of LLM Benchmarks, Jan 2026, arxiv
- Toward an evaluation science for generative AI systems, Mar 2025, arxiv
- Benchmark Evaluations, Applications, and Challenges of Large Vision Language Models: A Survey, UMD, Jan 2025, arxiv
- AI Benchmarks and Datasets for LLM Evaluation, Dec 2024, arxiv, a survey of many LLM benchmarks
- LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods, Dec 2024, arxiv
- A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations, EMNLP 2024, ACLAnthology
- A Survey on Evaluation of Multimodal Large Language Models, aug 2024, arxiv
- A Survey of Useful LLM Evaluation, Jun 2024, arxiv
- Evaluating Large Language Models: A Comprehensive Survey , Oct 2023 arxiv:
- A Survey on Evaluation of Large Language Models Jul 2023 arxiv:
- Through the Lens of Core Competency: Survey on Evaluation of Large Language Models, Aug 2023 , arxiv:
- for industry-specific surveys of evaluation methods for industries such as medical, see in respective parts of this compendium
Leaderboards and Arenas
- New Hard Leaderboard by HuggingFace leaderboard description, blog post
- MathArena Evaluating LLMs on Uncontaminated Math Competitions Evaluation code
- ViDoRe Benchmark V2: Raising the Bar for Visual Retrieval, The Visual Document Retrieval Benchmark, Mar 2025, HuggingSpace See leaderboard in the document
- The FACTS Grounding Leaderboard: Benchmarking LLMs' Ability to Ground Responses to Long-Form Input, DeepMind, Jan 2025, arxiv Leaderboard
- LMSys Arena (explanation:)
- Aider Polyglot, code edit benchmark, Aider Polyglot
- Salesforce's Contextual Bench leaderboard hugging face an overview of how different LLMs perform across a variety of contextual tasks,
- GAIA leaderboard, GAIA is a benchmark developed by Meta, HuggingFace to measure AGI Assistants, see GAIA: a benchmark for General AI Assistants
- WebQA - Multimodal and Multihop QA, by WebQA WebQA leaderboard
- ArenaHard Leaderboard Paper: From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline, UC Berkeley, Jun 2024, arxiv github repo ArenaHard benchmark
- OpenGPT-X Multi- Lingual European LLM Leaderboard, evaluation of LLMs for many European languages - on HuggingFace
- AllenAI's ZeroEval LeaderBoard benchmark: ZeroEval from AllenAI unified framework for evaluating (large) language models on various reasoning tasks
- OpenLLM Leaderboard
- MTEB
- SWE Bench
- AlpacaEval leaderboard Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators, Apr 2024, arxiv code
- Open Medical LLM Leaderboard from HF Explanation
- Gorilla, Berkeley function calling Leaderboard Explanation
- WildBench WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild
- Enterprise Scenarios, Patronus
- Vectara Hallucination Leaderboard
- Ray/Anyscale's LLM Performance Leaderboard (explanation:)
- Hugging Face LLM Performance [hugging face leaderboard](https://
