SkillAgentSearch skills...

LLMEvaluation

A comprehensive guide to LLM evaluation methods designed to assist in identifying the most suitable evaluation techniques for various use cases, promote the adoption of best practices in LLM assessment, and critically assess the effectiveness of these evaluation methods.

Install / Use

/learn @alopatenko/LLMEvaluation
About this skill

Quality Score

0/100

Category

Design

Supported Platforms

Universal

README

Awesome LLM Evaluation

Evaluation of LLM and LLM based Systems

Compendium of LLM Evaluation methods


Introduction

The aim of this compendium is to assist academics and industry professionals in creating effective evaluation suites tailored to their specific needs. It does so by reviewing the top industry practices for assessing large language models (LLMs) and their applications. This work goes beyond merely cataloging benchmarks and evaluation studies; it encompasses a comprehensive overview of all effective and practical evaluation techniques, including those embedded within papers that primarily introduce new LLM methodologies and tasks. I plan to periodically update this survey with any noteworthy and shareable evaluation methods that I come across. I aim to create a resource that will enable anyone with queries—whether it's about evaluating a large language model (LLM) or an LLM application for specific tasks, determining the best methods to assess LLM effectiveness, or understanding how well an LLM performs in a particular domain—to easily find all the relevant information needed for these tasks. Additionally, I want to highlight various methods for evaluating the evaluation tasks themselves, to ensure that these evaluations align effectively with business or academic objectives.

My view on LLM Evaluation: Deck 24, and SF Big Analytics and AICamp 24 video Analytics Vidhya (Data Phoenix Mar 5 24) (by Andrei Lopatenko)

Adjacent compendium on LLM, Search and Recommender engines

The github repository

Evals are surprisingly often all you need

Table of contents


Reviews and Surveys

  • Order in the Evaluation Court: A Critical Analysis of NLG Evaluation Trends, Jan 2026, arxiv
  • Benchmark^2: Systematic Evaluation of LLM Benchmarks, Jan 2026, arxiv
  • Toward an evaluation science for generative AI systems, Mar 2025, arxiv
  • Benchmark Evaluations, Applications, and Challenges of Large Vision Language Models: A Survey, UMD, Jan 2025, arxiv
  • AI Benchmarks and Datasets for LLM Evaluation, Dec 2024, arxiv, a survey of many LLM benchmarks
  • LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods, Dec 2024, arxiv
  • A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations, EMNLP 2024, ACLAnthology
  • A Survey on Evaluation of Multimodal Large Language Models, aug 2024, arxiv
  • A Survey of Useful LLM Evaluation, Jun 2024, arxiv
  • Evaluating Large Language Models: A Comprehensive Survey , Oct 2023 arxiv:
  • A Survey on Evaluation of Large Language Models Jul 2023 arxiv:
  • Through the Lens of Core Competency: Survey on Evaluation of Large Language Models, Aug 2023 , arxiv:
  • for industry-specific surveys of evaluation methods for industries such as medical, see in respective parts of this compendium

Leaderboards and Arenas

View on GitHub
GitHub Stars183
CategoryDesign
Updated5d ago
Forks17

Languages

HTML

Security Score

85/100

Audited on Apr 5, 2026

No findings