Results for "ai-benchmarks"

Claude Code Claude Desktop GitHub Copilot Cursor Windsurf Cline Zed JetBrains

📄SKILL.md 🤖CLAUDE.md ⚡Claude Commands 📐.cursorrules 📐Cursor Rules 🕹️AGENTS.md 🧬codex.md 🏄.windsurfrules 🔧.clinerules 🧑‍✈️Copilot Instructions

All Development Operations Data Product Marketing Customer Design Sales

392 skills found · Page 1 of 14

trycua / Cua

13.3k

Open-source infrastructure for Computer-Use Agents. Sandboxes, SDKs, and benchmarks to train and evaluate AI agents that can control full desktops (macOS, Linux, Windows).

universal

agentai-agentapple+15

Updated 24m ago

AgentOps-AI / Agentops

5.4k

Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks including CrewAI, Agno, OpenAI Agents SDK, Langchain, Autogen, AG2, and CamelAI

claude codeclaude desktop

agentagentopsagents-sdk+14

Updated 2h ago

microsoft / PhiCookBook

3.7k

This is a Phi Family of SLMs book for getting started with Phi Models. Phi a family of open sourced AI models developed by Microsoft. Phi models are the most capable and cost-effective small language models (SLMs) available, outperforming models of the same size and next size up across a variety of language, reasoning, coding, and math benchmarks

universal

cookbooklanguage-modelphi-4+10

Updated 1d ago

zzyfight / Genai Compliance Bench

1.6k

GenAI compliance benchmark is a evaluation benchmarks for generative AI in regulated industries.

universal

Updated 1d ago

openai / Mle Bench

1.4k

MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineering

universal

Updated 12h ago

Barca0412 / Introduction To Quantitative Finance

1.3k

入门资料整理：1.多因子股票量化框架开源教程 2.学界和业界的经典资料收录 3.AI + 金融的相关工作，包括LLM, Agent, benchmark(evaluation), etc.

universal

agentai4finfinance+8

Updated 5h ago

Tencent / AICGSecEval

1.1k

A.S.E (AICGSecEval) is a repository-level AI-generated code security evaluation benchmark developed by Tencent Wukong Code Security Team.

universal

agentaigcbenchmark+2

Updated 18h ago

HumanCompatibleAI / Overcooked AI

948

A benchmark environment for fully cooperative human-AI performance.

universal

artificial-intelligencedeep-learningmachine-learning+2

Updated 2d ago

microsoft / WindowsAgentArena

845

Windows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and benchmarking of multi-modal AI agents.

universal

agenticaiai-agent+6

Updated 1d ago

The-FinAI / PIXIU

839

This repository introduces PIXIU, an open-source resource featuring the first financial large language models (LLMs), instruction tuning data, and evaluation benchmarks to holistically assess financial LLMs. Our goal is to continually push forward the open-source development of financial artificial intelligence (AI).

universal

aifinancechatgptfintech+12

Updated 4d ago

pinchbench / Skill

824

PinchBench is a benchmarking system for evaluating LLM models as OpenClaw coding agents. Made with 🦀 by the humans at https://kilo.ai

universal

Updated 57m ago

giancarloerra / SocratiCode

653

Enterprise-grade (40m+ lines) codebase intelligence in a zero-setup, private and local Claude Plugin or MCP: managed indexing, hybrid semantic search, polyglot code dependency graphs, and DB/API/infra knowledge. Benchmark: 61% less tokens, 84% fewer calls, 37x faster than standard AI grep.

claude codeclaude desktop+2

aiai-assistantast+17

Updated 8h ago

mlcommons / Ck

644

Collective Knowledge (CK), Collective Mind (CM/CMX) and MLPerf automations: community-driven projects to facilitate collaborative and reproducible research and to learn how to run AI, ML, and other emerging workloads more efficiently and cost-effectively across diverse models, datasets, software, and hardware using MLPerf methodology and benchmarks

universal

automationbenchmarkingbest-practices+17

Updated 5d ago

SanMuzZzZz / LuaN1aoAgent

640

LuaN1aoAgent is a cognitive-driven AI hacker. It is a fully autonomous AI penetration testing agent powered by DeepSeek V3.2. Using dual-graph reasoning, LuaN1ao achieves a success rate of over 90% on the XBOW Benchmark, with a median exploit cost of just $0.09.

universal

agentsaiai-agents+14

Updated 1h ago

onejune2018 / Awesome LLM Eval

626

Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on LLMs. 一个由工具、基准/数据、演示、排行榜和大模型等组成的精选列表，主要面向基础大模型评测，旨在探求生成式AI的技术边界.

universal

awsome-listawsome-listsbenchmark+16

Updated 1h ago

abacusai / Long Context

601

This repository contains code and tooling for the Abacus.AI LLM Context Expansion project. Also included are evaluation scripts and benchmark tasks that evaluate a model’s information retrieval capabilities with context expansion. We also include key experimental results and instructions for reproducing and building on them.

universal

Updated 3h ago

facebookresearch / MLGym

592

MLGym A New Framework and Benchmark for Advancing AI Research Agents

universal

Updated 2d ago

SalesforceAIResearch / MCP Universe

575

MCP-Universe is a comprehensive framework designed for RL training, benchmarking, and developing AI agents for general tool-use.

claude codecursor

Updated 2d ago

facebookresearch / Meta Agents Research Environments

463

Meta Agents Research Environments is a comprehensive platform designed to evaluate AI agents in dynamic, realistic scenarios. Unlike static benchmarks, this platform introduces evolving environments where agents must adapt their strategies as new information becomes available, mirroring real-world challenges.

universal

agentsaiautonomous-agents+10

Updated 9h ago

camel-ai / Crab

410

🦀️ CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents. https://crab.camel-ai.org/

universal

gui-automationlanguage-model-agentlarge-language-models+2

Updated 6d ago