1,019 skills found · Page 1 of 34
mlflow / MlflowThe open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controlling costs and managing access to models and data.
comet-ml / OpikDebug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
openai / EvalsEvals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
raga-ai-hub / RagaAI CatalystPython SDK for Agent AI Observability, Monitoring and Evaluation Framework. Includes features like agent, llm and tools tracing, debugging multi-agentic system, self-hosted dashboard and advanced analytics with timeline and execution graph view
confident-ai / DeepevalThe LLM Evaluation Framework
vibrantlabsai / RagasSupercharge Your LLM Application Evaluations 🚀
ShishirPatil / GorillaGorilla: Training and Evaluating LLMs for Function Calls (Tool Calls)
dataelement / BishengBISHENG is an open LLM devops platform for next generation Enterprise AI applications. Powerful and comprehensive features include: GenAI workflow, RAG, Agent, Unified model management, Evaluation, SFT, Dataset Management, Enterprise-level System Management, Observability and more.
tensorzero / TensorzeroTensorZero is an open-source LLMOps platform that unifies an LLM gateway, observability, evaluation, optimization, and experimentation.
oumi-ai / OumiEasily fine-tune, evaluate and deploy gpt-oss, Qwen3, DeepSeek-R1, or any open source LLM / VLM!
evidentlyai / EvidentlyEvidently is an open-source ML and LLM observability framework. Evaluate, test, and monitor any AI-powered system or data pipeline. From tabular data to Gen AI. 100+ metrics.
open-compass / OpencompassOpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 🍓
Giskard-AI / Giskard Oss🐢 Open-Source Evaluation & Testing library for LLM Agents
lm-sys / RouteLLMA framework for serving and evaluating LLM routers - save LLM costs without compromising quality
Agenta-AI / AgentaThe open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.
Tencent / AI Infra GuardA full-stack AI Red Teaming platform securing AI ecosystems via OpenClaw Security Scan, Agent Scan, Skills Scan, MCP scan, AI Infra scan and LLM jailbreak evaluation.
THUDM / AgentBenchA Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
truera / TrulensEvaluation and Tracking for LLM Experiments and AI Agents
langwatch / LangwatchThe platform for LLM evaluations and AI agent testing