3,079 skills found · Page 1 of 103
lm-sys / FastChatAn open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
mlflow / MlflowThe open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controlling costs and managing access to models and data.
EleutherAI / Lm Evaluation HarnessA framework for few-shot evaluation of language models.
dataelement / BishengBISHENG is an open LLM devops platform for next generation Enterprise AI applications. Powerful and comprehensive features include: GenAI workflow, RAG, Agent, Unified model management, Evaluation, SFT, Dataset Management, Enterprise-level System Management, Observability and more.
facebookresearch / ParlAIA framework for training and evaluating AI models on a variety of openly available dialogue datasets.
open-compass / OpencompassOpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
OpenBMB / ToolBench[ICLR'24 spotlight] An open platform for training, serving, and evaluating large language model for tool learning.
transformerlab / Transformerlab AppThe open source research environment for AI researchers to seamlessly train, evaluate, and scale models from local hardware to GPU clusters.
CLUEbenchmark / CLUE中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard
open-compass / VLMEvalKitOpen-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
evalstate / Fast AgentCode, Build and Evaluate agents - excellent Model and Skills/MCP/ACP Support
openai / Human EvalCode for the paper "Evaluating Large Language Models Trained on Code"
FreedomIntelligence / LLMZoo⚡LLM Zoo is a project that provides data, models, and evaluation benchmark for large language models.⚡
microsoft / Table TransformerTable Transformer (TATR) is a deep learning model for extracting tables from unstructured documents (PDFs and images). This is also the official repository for the PubTables-1M dataset and GriTS evaluation metric.
microsoftarchive / PromptbenchA unified evaluation framework for large language models
stanford-crfm / HelmHolistic Evaluation of Language Models (HELM) is an open source Python framework created by the Center for Research on Foundation Models (CRFM) at Stanford for holistic, reproducible and transparent evaluation of foundation models, including large language models (LLMs) and multimodal models.
modelscope / EvalscopeA streamlined and customizable framework for efficient large model (LLM, VLM, AIGC) evaluation and performance benchmarking.
young-geng / EasyLMLarge language models (LLMs) made easy, EasyLM is a one stop solution for pre-training, finetuning, evaluating and serving LLMs in JAX/Flax.
huggingface / Evaluate🤗 Evaluate: A library for easily evaluating machine learning models and datasets.
beir-cellar / BeirA Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.