Results for "llm-evaluation"

Claude Code Claude Desktop GitHub Copilot Cursor Windsurf Cline Zed JetBrains

📄SKILL.md 🤖CLAUDE.md ⚡Claude Commands 📐.cursorrules 📐Cursor Rules 🕹️AGENTS.md 🧬codex.md 🏄.windsurfrules 🔧.clinerules 🧑‍✈️Copilot Instructions

All Development Operations Data Product Marketing Customer Design Sales

1,160 skills found · Page 5 of 39

PAIR-code / Llm Comparator

522

LLM Comparator is an interactive data visualization tool for evaluating and analyzing LLM responses side-by-side, developed by the PAIR team.

universal

Updated 14h ago

relari-ai / Continuous Eval

515

Data-Driven Evaluation for LLM-Powered Applications

universal

evaluation-frameworkevaluation-metricsinformation-retrieval+4

Updated 4d ago

oxbshw / LLM Agents Ecosystem Handbook

501

One-stop handbook for building, deploying, and understanding LLM agents with 60+ skeletons, tutorials, ecosystem guides, and evaluation tools.

claude codecursor

aiai-agentai-agents+11

Updated 1h ago

ethz-spylab / Agentdojo

493

A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents.

universal

benchmarklarge-language-modelsprompt-injection+1

Updated 5h ago

zeno-ml / Zeno Build

492

Build, evaluate, understand, and fix LLM-based apps

universal

Updated 4mo ago

ByteDance-Seed / EvaLearn

432

EvaLearn is a pioneering benchmark designed to evaluate large language models (LLMs) on their learning capability and efficiency in challenging tasks.

universal

dynamic-evaluationevaluationllms

Updated 13d ago

abacaj / Code Eval

429

Run evaluation on LLMs using human-eval benchmark

universal

humanevalwizardcoder

Updated 11h ago

arthur-ai / Bench

428

A tool for evaluating LLMs

universal

llmmlops

Updated 13d ago

baaivision / JudgeLM

424

[ICLR 2025 Spotlight] An open-sourced LLM judge for evaluating LLM-generated answers.

universal

Updated 2d ago

hkust-nlp / AgentBoard

399

An Analytical Evaluation Board of Multi-turn LLM Agents [NeurIPS 2024 Oral]

universal

Updated 16h ago

IceBearAI / LLM And More

386

LLM-And-More is a professional, plug-and-play, llm trainer and application builder that guides you through the complete LLM workflow from data to evaluation, from training to deployment, from idea to sevice. / LLM-And-More 是一个专业、开箱即用的大模型训练及应用构建一站式解决方案，包含从数据到评估、从训练到部署、从想法到服务的全流程最佳实践。

universal

Updated 5mo ago

allenai / OLMo Eval Legacy

379

Evaluation suite for LLMs

universal

Updated 2d ago

JonathanChavezTamales / Llm Leaderboard

363

A comprehensive set of LLM benchmark scores and provider prices. (deprecated, read more in README)

universal

llmllm-agentsllm-evaluation+2

Updated 5d ago

AILab-CVC / SEED Bench

361

(CVPR2024)A benchmark for evaluating Multimodal LLMs using multiple-choice questions.

universal

Updated 18d ago

allenai / Olmes

352

Reproducible, flexible LLM evaluations

universal

Updated 2d ago

agiresearch / OpenP5

344

OpenP5: An Open-Source Platform for Developing, Training, and Evaluating LLM-based Recommender Systems

universal

Updated 23h ago

GPT-Fathom / GPT Fathom

344

GPT-Fathom is an open-source and reproducible LLM evaluation suite, benchmarking 10+ leading open-source and closed-source LLMs as well as OpenAI's earlier models on 20+ curated benchmarks under aligned settings.

universal

Updated 13d ago

palico-ai / Palico AI

342

Build, Improve Performance, and Productionize your LLM Application with an Integrated Framework

claude codeclaude desktop

aianthropicautogen+16

Updated 14d ago

sci-m-wang / OpenCE

334

OpenCE (Open Context Engineering): A community toolkit to implement, evaluate, and combine LLM context strategies (RAG, ACE, Compression). Evolved from the `ACE-open` reproduction.

universal

Updated 54m ago

thunlp / ChatEval

328

Codes for our paper "ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate"

universal

Updated 16d ago