Results for "evaluation-llms"

Claude Code Claude Desktop GitHub Copilot Cursor Windsurf Cline Zed JetBrains

📄SKILL.md 🤖CLAUDE.md ⚡Claude Commands 📐.cursorrules 📐Cursor Rules 🕹️AGENTS.md 🧬codex.md 🏄.windsurfrules 🔧.clinerules 🧑‍✈️Copilot Instructions

All Development Operations Data Product Marketing Customer Design Sales

1,019 skills found · Page 1 of 34

mlflow / Mlflow

25.1k

The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controlling costs and managing access to models and data.

universal

agentopsagentsai+15

Updated 12m ago

comet-ml / Opik

18.6k

Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

universal

evaluationhacktoberfesthacktoberfest2025+10

Updated 9m ago

openai / Evals

18.1k

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

universal

Updated 1h ago

raga-ai-hub / RagaAI Catalyst

16.1k

Python SDK for Agent AI Observability, Monitoring and Evaluation Framework. Includes features like agent, llm and tools tracing, debugging multi-agentic system, self-hosted dashboard and advanced analytics with timeline and execution graph view

universal

agentic-aiagentic-ai-developmentagentneo+9

Updated 2h ago

confident-ai / Deepeval

14.4k

The LLM Evaluation Framework

universal

evaluation-frameworkevaluation-metricsllm-evaluation+3

Updated 1h ago

vibrantlabsai / Ragas

13.2k

Supercharge Your LLM Application Evaluations 🚀

universal

evaluationllmllmops

Updated 4h ago

ShishirPatil / Gorilla

12.8k

Gorilla: Training and Evaluating LLMs for Function Calls (Tool Calls)

claude codeclaude desktop

apiapi-documentationchatgpt+5

Updated 3h ago

dataelement / Bisheng

11.3k

BISHENG is an open LLM devops platform for next generation Enterprise AI applications. Powerful and comprehensive features include: GenAI workflow, RAG, Agent, Unified model management, Evaluation, SFT, Dataset Management, Enterprise-level System Management, Observability and more.

universal

agentaichatbot+17

Updated 5h ago

tensorzero / Tensorzero

11.2k

TensorZero is an open-source LLMOps platform that unifies an LLM gateway, observability, evaluation, optimization, and experimentation.

claude codeclaude desktop

aiai-engineeringanthropic+17

Updated 10m ago

oumi-ai / Oumi

9.1k

Easily fine-tune, evaluate and deploy gpt-oss, Qwen3, DeepSeek-R1, or any open source LLM / VLM!

universal

dpoevaluationfine-tuning+9

Updated 5m ago

evidentlyai / Evidently

7.4k

Evidently is an open-source ML and LLM observability framework. Evaluate, test, and monitor any AI-powered system or data pipeline. From tabular data to Gen AI. 100+ metrics.

universal

data-driftdata-qualitydata-science+11

Updated 11h ago

open-compass / Opencompass

6.8k

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

claude codeclaude desktop

benchmarkchatgptevaluation+5

Updated 5h ago

Helicone / Helicone

5.4k

🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 🍓

universal

agent-monitoringanalyticsevaluation+16

Updated 3h ago

Giskard-AI / Giskard Oss

5.2k

🐢 Open-Source Evaluation & Testing library for LLM Agents

universal

agent-evaluationai-red-teamai-security+14

Updated 4h ago

lm-sys / RouteLLM

4.8k

A framework for serving and evaluating LLM routers - save LLM costs without compromising quality

universal

Updated 2h ago

Agenta-AI / Agenta

4.0k

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

universal

agentsevaluationllm-as-a-judge+12

Updated 1h ago

Tencent / AI Infra Guard

3.4k

A full-stack AI Red Teaming platform securing AI ecosystems via OpenClaw Security Scan, Agent Scan, Skills Scan, MCP scan, AI Infra scan and LLM jailbreak evaluation.

claude codecursor

agentaibenchmark+13

Updated 10h ago