14,206 skills found · Page 1 of 474
lm-sys / FastChatAn open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
mlflow / MlflowThe open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controlling costs and managing access to models and data.
google / Adk PythonAn open-source, code-first Python toolkit for building, evaluating, and deploying sophisticated AI agents with flexibility and control.
comet-ml / OpikDebug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
openai / EvalsEvals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
raga-ai-hub / RagaAI CatalystPython SDK for Agent AI Observability, Monitoring and Evaluation Framework. Includes features like agent, llm and tools tracing, debugging multi-agentic system, self-hosted dashboard and advanced analytics with timeline and execution graph view
confident-ai / DeepevalThe LLM Evaluation Framework
trycua / CuaOpen-source infrastructure for Computer-Use Agents. Sandboxes, SDKs, and benchmarks to train and evaluate AI agents that can control full desktops (macOS, Linux, Windows).
vibrantlabsai / RagasSupercharge Your LLM Application Evaluations 🚀
ShishirPatil / GorillaGorilla: Training and Evaluating LLMs for Function Calls (Tool Calls)
EleutherAI / Lm Evaluation HarnessA framework for few-shot evaluation of language models.
dataelement / BishengBISHENG is an open LLM devops platform for next generation Enterprise AI applications. Powerful and comprehensive features include: GenAI workflow, RAG, Agent, Unified model management, Evaluation, SFT, Dataset Management, Enterprise-level System Management, Observability and more.
tensorzero / TensorzeroTensorZero is an open-source LLMOps platform that unifies an LLM gateway, observability, evaluation, optimization, and experimentation.
facebookresearch / ParlAIA framework for training and evaluating AI models on a variety of openly available dialogue datasets.
Theano / TheanoTheano was a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. It is being continued as PyTensor: www.github.com/pymc-devs/pytensor
Arize-ai / PhoenixAI Observability & Evaluation
oumi-ai / OumiEasily fine-tune, evaluate and deploy gpt-oss, Qwen3, DeepSeek-R1, or any open source LLM / VLM!
expr-lang / ExprExpression language and expression evaluation for Go
evidentlyai / EvidentlyEvidently is an open-source ML and LLM observability framework. Evaluate, test, and monitor any AI-powered system or data pipeline. From tabular data to Gen AI. 100+ metrics.
google / Adk GoAn open-source, code-first Go toolkit for building, evaluating, and deploying sophisticated AI agents with flexibility and control.