392 skills found · Page 1 of 14
trycua / CuaOpen-source infrastructure for Computer-Use Agents. Sandboxes, SDKs, and benchmarks to train and evaluate AI agents that can control full desktops (macOS, Linux, Windows).
AgentOps-AI / AgentopsPython SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks including CrewAI, Agno, OpenAI Agents SDK, Langchain, Autogen, AG2, and CamelAI
microsoft / PhiCookBookThis is a Phi Family of SLMs book for getting started with Phi Models. Phi a family of open sourced AI models developed by Microsoft. Phi models are the most capable and cost-effective small language models (SLMs) available, outperforming models of the same size and next size up across a variety of language, reasoning, coding, and math benchmarks
zzyfight / Genai Compliance BenchGenAI compliance benchmark is a evaluation benchmarks for generative AI in regulated industries.
openai / Mle BenchMLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineering
Barca0412 / Introduction To Quantitative Finance入门资料整理:1.多因子股票量化框架开源教程 2.学界和业界的经典资料收录 3.AI + 金融的相关工作,包括LLM, Agent, benchmark(evaluation), etc.
Tencent / AICGSecEvalA.S.E (AICGSecEval) is a repository-level AI-generated code security evaluation benchmark developed by Tencent Wukong Code Security Team.
HumanCompatibleAI / Overcooked AIA benchmark environment for fully cooperative human-AI performance.
microsoft / WindowsAgentArenaWindows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and benchmarking of multi-modal AI agents.
The-FinAI / PIXIUThis repository introduces PIXIU, an open-source resource featuring the first financial large language models (LLMs), instruction tuning data, and evaluation benchmarks to holistically assess financial LLMs. Our goal is to continually push forward the open-source development of financial artificial intelligence (AI).
pinchbench / SkillPinchBench is a benchmarking system for evaluating LLM models as OpenClaw coding agents. Made with 🦀 by the humans at https://kilo.ai
giancarloerra / SocratiCodeEnterprise-grade (40m+ lines) codebase intelligence in a zero-setup, private and local Claude Plugin or MCP: managed indexing, hybrid semantic search, polyglot code dependency graphs, and DB/API/infra knowledge. Benchmark: 61% less tokens, 84% fewer calls, 37x faster than standard AI grep.
mlcommons / CkCollective Knowledge (CK), Collective Mind (CM/CMX) and MLPerf automations: community-driven projects to facilitate collaborative and reproducible research and to learn how to run AI, ML, and other emerging workloads more efficiently and cost-effectively across diverse models, datasets, software, and hardware using MLPerf methodology and benchmarks
SanMuzZzZz / LuaN1aoAgentLuaN1aoAgent is a cognitive-driven AI hacker. It is a fully autonomous AI penetration testing agent powered by DeepSeek V3.2. Using dual-graph reasoning, LuaN1ao achieves a success rate of over 90% on the XBOW Benchmark, with a median exploit cost of just $0.09.
onejune2018 / Awesome LLM EvalAwesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on LLMs. 一个由工具、基准/数据、演示、排行榜和大模型等组成的精选列表,主要面向基础大模型评测,旨在探求生成式AI的技术边界.
abacusai / Long ContextThis repository contains code and tooling for the Abacus.AI LLM Context Expansion project. Also included are evaluation scripts and benchmark tasks that evaluate a model’s information retrieval capabilities with context expansion. We also include key experimental results and instructions for reproducing and building on them.
facebookresearch / MLGymMLGym A New Framework and Benchmark for Advancing AI Research Agents
SalesforceAIResearch / MCP UniverseMCP-Universe is a comprehensive framework designed for RL training, benchmarking, and developing AI agents for general tool-use.
facebookresearch / Meta Agents Research EnvironmentsMeta Agents Research Environments is a comprehensive platform designed to evaluate AI agents in dynamic, realistic scenarios. Unlike static benchmarks, this platform introduces evolving environments where agents must adapt their strategies as new information becomes available, mirroring real-world challenges.
camel-ai / Crab🦀️ CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents. https://crab.camel-ai.org/