Results for "evaluating-models"

Claude Code Claude Desktop GitHub Copilot Cursor Windsurf Cline Zed JetBrains

📄SKILL.md 🤖CLAUDE.md ⚡Claude Commands 📐.cursorrules 📐Cursor Rules 🕹️AGENTS.md 🧬codex.md 🏄.windsurfrules 🔧.clinerules 🧑‍✈️Copilot Instructions

All Development Operations Data Product Marketing Customer Design Sales

3,079 skills found · Page 1 of 103

lm-sys / FastChat

39.5k

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.

universal

Updated 9h ago

mlflow / Mlflow

25.0k

The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controlling costs and managing access to models and data.

universal

agentopsagentsai+15

Updated 48m ago

EleutherAI / Lm Evaluation Harness

11.9k

A framework for few-shot evaluation of language models.

universal

evaluation-frameworklanguage-modeltransformer

Updated 40m ago

dataelement / Bisheng

11.3k

BISHENG is an open LLM devops platform for next generation Enterprise AI applications. Powerful and comprehensive features include: GenAI workflow, RAG, Agent, Unified model management, Evaluation, SFT, Dataset Management, Enterprise-level System Management, Observability and more.

universal

agentaichatbot+17

Updated 6m ago

facebookresearch / ParlAI

10.6k

A framework for training and evaluating AI models on a variety of openly available dialogue datasets.

universal

Updated 23h ago

open-compass / Opencompass

6.8k

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

claude codeclaude desktop

benchmarkchatgptevaluation+5

Updated 14h ago

OpenBMB / ToolBench

5.6k

[ICLR'24 spotlight] An open platform for training, serving, and evaluating large language model for tool learning.

universal

Updated 14m ago

transformerlab / Transformerlab App

4.8k

The open source research environment for AI researchers to seamlessly train, evaluate, and scale models from local hardware to GPU clusters.

universal

diffusiondiffusion-modelselectron+7

Updated 6h ago

CLUEbenchmark / CLUE

4.2k

中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard

universal

albertbenchmarkbert+12

Updated 5d ago

open-compass / VLMEvalKit

4.0k

Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks

claude codeclaude desktop+1

chatgptclaudeclip+16

Updated 12h ago

evalstate / Fast Agent

3.7k

Code, Build and Evaluate agents - excellent Model and Skills/MCP/ACP Support

claude codecursor

acpagentagent-framework+8

Updated 50m ago

openai / Human Eval

3.2k

Code for the paper "Evaluating Large Language Models Trained on Code"

universal

Updated 11h ago

FreedomIntelligence / LLMZoo

3.0k

⚡LLM Zoo is a project that provides data, models, and evaluation benchmark for large language models.⚡

universal

Updated 10h ago

microsoft / Table Transformer

2.9k

Table Transformer (TATR) is a deep learning model for extracting tables from unstructured documents (PDFs and images). This is also the official repository for the PubTables-1M dataset and GriTS evaluation metric.

universal

table-detectiontable-extractiontable-functional-analysis+1

Updated 16h ago

microsoftarchive / Promptbench

2.8k

A unified evaluation framework for large language models

universal

adversarial-attacksbenchmarkchatgpt+5

Updated 20h ago

stanford-crfm / Helm

2.7k

Holistic Evaluation of Language Models (HELM) is an open source Python framework created by the Center for Research on Foundation Models (CRFM) at Stanford for holistic, reproducible and transparent evaluation of foundation models, including large language models (LLMs) and multimodal models.

universal

Updated 1h ago