1,160 skills found · Page 5 of 39
PAIR-code / Llm ComparatorLLM Comparator is an interactive data visualization tool for evaluating and analyzing LLM responses side-by-side, developed by the PAIR team.
relari-ai / Continuous EvalData-Driven Evaluation for LLM-Powered Applications
oxbshw / LLM Agents Ecosystem HandbookOne-stop handbook for building, deploying, and understanding LLM agents with 60+ skeletons, tutorials, ecosystem guides, and evaluation tools.
ethz-spylab / AgentdojoA Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents.
zeno-ml / Zeno BuildBuild, evaluate, understand, and fix LLM-based apps
ByteDance-Seed / EvaLearnEvaLearn is a pioneering benchmark designed to evaluate large language models (LLMs) on their learning capability and efficiency in challenging tasks.
abacaj / Code EvalRun evaluation on LLMs using human-eval benchmark
arthur-ai / BenchA tool for evaluating LLMs
baaivision / JudgeLM[ICLR 2025 Spotlight] An open-sourced LLM judge for evaluating LLM-generated answers.
hkust-nlp / AgentBoardAn Analytical Evaluation Board of Multi-turn LLM Agents [NeurIPS 2024 Oral]
IceBearAI / LLM And MoreLLM-And-More is a professional, plug-and-play, llm trainer and application builder that guides you through the complete LLM workflow from data to evaluation, from training to deployment, from idea to sevice. / LLM-And-More 是一个专业、开箱即用的大模型训练及应用构建一站式解决方案,包含从数据到评估、从训练到部署、从想法到服务的全流程最佳实践。
allenai / OLMo Eval LegacyEvaluation suite for LLMs
JonathanChavezTamales / Llm LeaderboardA comprehensive set of LLM benchmark scores and provider prices. (deprecated, read more in README)
AILab-CVC / SEED Bench(CVPR2024)A benchmark for evaluating Multimodal LLMs using multiple-choice questions.
allenai / OlmesReproducible, flexible LLM evaluations
agiresearch / OpenP5OpenP5: An Open-Source Platform for Developing, Training, and Evaluating LLM-based Recommender Systems
GPT-Fathom / GPT FathomGPT-Fathom is an open-source and reproducible LLM evaluation suite, benchmarking 10+ leading open-source and closed-source LLMs as well as OpenAI's earlier models on 20+ curated benchmarks under aligned settings.
palico-ai / Palico AIBuild, Improve Performance, and Productionize your LLM Application with an Integrated Framework
sci-m-wang / OpenCEOpenCE (Open Context Engineering): A community toolkit to implement, evaluate, and combine LLM context strategies (RAG, ACE, Compression). Evolved from the `ACE-open` reproduction.
thunlp / ChatEvalCodes for our paper "ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate"