Results for "llm-evaluation"

Claude Code Claude Desktop GitHub Copilot Cursor Windsurf Cline Zed JetBrains

📄SKILL.md 🤖CLAUDE.md ⚡Claude Commands 📐.cursorrules 📐Cursor Rules 🕹️AGENTS.md 🧬codex.md 🏄.windsurfrules 🔧.clinerules 🧑‍✈️Copilot Instructions

All Development Operations Data Product Marketing Customer Design Sales

1,160 skills found · Page 1 of 39

mlflow / Mlflow

24.9k

The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controlling costs and managing access to models and data.

universal

agentopsagentsai+15

Updated 10m ago

langfuse / Langfuse

23.4k

🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

universal

analyticsautogenevaluation+16

Updated 36m ago

comet-ml / Opik

18.4k

Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

universal

evaluationhacktoberfesthacktoberfest2025+10

Updated 45m ago

openai / Evals

18.0k

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

universal

Updated 55m ago

promptfoo / Promptfoo

17.7k

Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

claude codeclaude desktop+1

cici-cdcicd+15

Updated 4m ago

raga-ai-hub / RagaAI Catalyst

16.1k

Python SDK for Agent AI Observability, Monitoring and Evaluation Framework. Includes features like agent, llm and tools tracing, debugging multi-agentic system, self-hosted dashboard and advanced analytics with timeline and execution graph view

universal

agentic-aiagentic-ai-developmentagentneo+9

Updated 13h ago

confident-ai / Deepeval

14.2k

The LLM Evaluation Framework

universal

evaluation-frameworkevaluation-metricsllm-evaluation+3

Updated 2h ago

vibrantlabsai / Ragas

13.0k

Supercharge Your LLM Application Evaluations 🚀

universal

evaluationllmllmops

Updated 5h ago

ShishirPatil / Gorilla

12.8k

Gorilla: Training and Evaluating LLMs for Function Calls (Tool Calls)

claude codeclaude desktop

apiapi-documentationchatgpt+5

Updated 15h ago

dataelement / Bisheng

11.2k

BISHENG is an open LLM devops platform for next generation Enterprise AI applications. Powerful and comprehensive features include: GenAI workflow, RAG, Agent, Unified model management, Evaluation, SFT, Dataset Management, Enterprise-level System Management, Observability and more.

universal

agentaichatbot+17

Updated 13h ago

tensorzero / Tensorzero

11.1k

TensorZero is an open-source stack for industrial-grade LLM applications. It unifies an LLM gateway, observability, optimization, evaluation, and experimentation.

claude codeclaude desktop

aiai-engineeringanthropic+17

Updated 35m ago

Arize-ai / Phoenix

8.9k

AI Observability & Evaluation

claude codeclaude desktop

agentsai-monitoringai-observability+13

Updated 33m ago

oumi-ai / Oumi

8.9k

Easily fine-tune, evaluate and deploy gpt-oss, Qwen3, DeepSeek-R1, or any open source LLM / VLM!

universal

dpoevaluationfine-tuning+9

Updated 16m ago

evidentlyai / Evidently

7.3k

Evidently is an open-source ML and LLM observability framework. Evaluate, test, and monitor any AI-powered system or data pipeline. From tabular data to Gen AI. 100+ metrics.

universal

data-driftdata-qualitydata-science+11

Updated 6h ago

NVIDIA / Garak

7.3k

the LLM vulnerability scanner

universal

aillm-evaluationllm-security+2

Updated 1h ago

open-compass / Opencompass

6.8k

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

claude codeclaude desktop

benchmarkchatgptevaluation+5

Updated 1h ago

jeinlee1991 / Chinese Llm Benchmark

5.7k

ReLE评测：中文AI大模型能力评测（持续更新）：目前已囊括359个大模型，覆盖chatgpt、gpt-5.2、o4-mini、谷歌gemini-3-pro、Claude-4.6、文心ERNIE-X1.1、ERNIE-5.0、qwen3-max、qwen3.5-plus、百川、讯飞星火、商汤senseChat等商用模型，以及step3.5-flash、kimi-k2.5、ernie4.5、MiniMax-M2.5、deepseek-v3.2、Qwen3.5、llama4、智谱GLM-5、GLM-4.7、LongCat、gemma3、mistral等开源大模型。不仅提供排行榜，也提供规模超200万的大模型缺陷库！方便广大社区研究分析、改进大模型。

claude codeclaude desktop+1

agentic-aiartificial-intelligencellm-agent+1

Updated 10h ago