Results for "evaluate"

Claude Code Claude Desktop GitHub Copilot Cursor Windsurf Cline Zed JetBrains

📄SKILL.md 🤖CLAUDE.md ⚡Claude Commands 📐.cursorrules 📐Cursor Rules 🕹️AGENTS.md 🧬codex.md 🏄.windsurfrules 🔧.clinerules 🧑‍✈️Copilot Instructions

All Development Operations Data Product Marketing Customer Design Sales

14,206 skills found · Page 1 of 474

lm-sys / FastChat

39.5k

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.

universal

Updated 5m ago

mlflow / Mlflow

25.0k

The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controlling costs and managing access to models and data.

universal

agentopsagentsai+15

Updated 18m ago

google / Adk Python

18.6k

An open-source, code-first Python toolkit for building, evaluating, and deploying sophisticated AI agents with flexibility and control.

universal

agentagenticagentic-ai+13

Updated 16m ago

comet-ml / Opik

18.5k

Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

universal

evaluationhacktoberfesthacktoberfest2025+10

Updated 1h ago

openai / Evals

18.1k

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

universal

Updated 1h ago

raga-ai-hub / RagaAI Catalyst

16.1k

Python SDK for Agent AI Observability, Monitoring and Evaluation Framework. Includes features like agent, llm and tools tracing, debugging multi-agentic system, self-hosted dashboard and advanced analytics with timeline and execution graph view

universal

agentic-aiagentic-ai-developmentagentneo+9

Updated 8h ago

confident-ai / Deepeval

14.3k

The LLM Evaluation Framework

universal

evaluation-frameworkevaluation-metricsllm-evaluation+3

Updated 31m ago

trycua / Cua

13.3k

Open-source infrastructure for Computer-Use Agents. Sandboxes, SDKs, and benchmarks to train and evaluate AI agents that can control full desktops (macOS, Linux, Windows).

universal

agentai-agentapple+15

Updated 13m ago

vibrantlabsai / Ragas

13.1k

Supercharge Your LLM Application Evaluations 🚀

universal

evaluationllmllmops

Updated 4h ago

ShishirPatil / Gorilla

12.8k

Gorilla: Training and Evaluating LLMs for Function Calls (Tool Calls)

claude codeclaude desktop

apiapi-documentationchatgpt+5

Updated 4h ago

EleutherAI / Lm Evaluation Harness

11.9k

A framework for few-shot evaluation of language models.

universal

evaluation-frameworklanguage-modeltransformer

Updated 2h ago

dataelement / Bisheng

11.2k

BISHENG is an open LLM devops platform for next generation Enterprise AI applications. Powerful and comprehensive features include: GenAI workflow, RAG, Agent, Unified model management, Evaluation, SFT, Dataset Management, Enterprise-level System Management, Observability and more.

universal

agentaichatbot+17

Updated 1h ago

tensorzero / Tensorzero

11.2k

TensorZero is an open-source LLMOps platform that unifies an LLM gateway, observability, evaluation, optimization, and experimentation.

claude codeclaude desktop

aiai-engineeringanthropic+17

Updated 1m ago

facebookresearch / ParlAI

10.6k

A framework for training and evaluating AI models on a variety of openly available dialogue datasets.

universal

Updated 12h ago

Theano / Theano

10.0k

Theano was a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. It is being continued as PyTensor: www.github.com/pymc-devs/pytensor

universal

Updated 15h ago

Arize-ai / Phoenix

9.1k

AI Observability & Evaluation

claude codeclaude desktop

agentsai-monitoringai-observability+13

Updated 1h ago

oumi-ai / Oumi

8.9k

Easily fine-tune, evaluate and deploy gpt-oss, Qwen3, DeepSeek-R1, or any open source LLM / VLM!

universal

dpoevaluationfine-tuning+9

Updated 44m ago

expr-lang / Expr

7.8k

Expression language and expression evaluation for Go

universal

bytecodeconfiguration-languageengine+8

Updated 6h ago

evidentlyai / Evidently

7.3k

Evidently is an open-source ML and LLM observability framework. Evaluate, test, and monitor any AI-powered system or data pipeline. From tabular data to Gen AI. 100+ metrics.

universal

data-driftdata-qualitydata-science+11

Updated 10h ago

google / Adk Go

7.3k

An open-source, code-first Go toolkit for building, evaluating, and deploying sophisticated AI agents with flexibility and control.

gemini cliclaude code+1

a2aagentsagents-sdk+11

Updated 58m ago