SkillAgentSearch skills...

Mcpbr

Model Context Protocol Benchmark Runner

Install / Use

/learn @supermodeltools/Mcpbr
About this skill

Quality Score

0/100

Supported Platforms

Claude Code
Cursor

README

mcpbr

# One-liner install (installs + runs quick test)
curl -sSL https://raw.githubusercontent.com/greynewell/mcpbr/main/install.sh | bash

# Or install and run manually
pip install mcpbr && mcpbr run -n 1

Benchmark your MCP server against real GitHub issues. One command, hard numbers.


<p align="center"> <img src="https://raw.githubusercontent.com/greynewell/mcpbr/main/assets/mcpbr-logo.jpg" alt="MCPBR Logo" width="400"> </p>

Model Context Protocol Benchmark Runner

PyPI version npm version Python 3.11+ CI License: MIT DOI Documentation CodeRabbit Pull Request Reviews

good first issues help wanted roadmap

Stop guessing if your MCP server actually helps. Get hard numbers comparing tool-assisted vs. baseline agent performance on real GitHub issues.

<p align="center"> <img src="https://raw.githubusercontent.com/greynewell/mcpbr/main/assets/mcpbr-demo.gif" alt="mcpbr in action" width="700"> </p>

⭐ Star the Supermodel Ecosystem

If this is useful, please star our tools — it helps us grow:

mcp  mcpbr  typescript-sdk  arch-docs  dead-code-hunter  Uncompact  narsil-mcp


What You Get

<p align="center"> <img src="https://raw.githubusercontent.com/greynewell/mcpbr/main/assets/mcpbr-eval-results.png" alt="MCPBR Evaluation Results" width="600"> </p>

Real metrics showing whether your MCP server improves agent performance on SWE-bench tasks. No vibes, just data.

Why mcpbr?

MCP servers promise to make LLMs better at coding tasks. But how do you prove it?

mcpbr runs controlled experiments: same model, same tasks, same environment - the only variable is your MCP server. You get:

  • Apples-to-apples comparison against a baseline agent
  • Real GitHub issues from SWE-bench (not toy examples)
  • Reproducible results via Docker containers with pinned dependencies

Blog

Research Paper

mcpbr: Benchmarking Model Context Protocol Servers on Software Engineering Tasks Grey Newell, Georgia Institute of Technology, 2026

We evaluated a code graph analysis MCP server on all 500 tasks from SWE-bench Verified using Claude Sonnet as the base agent. Key findings:

| Metric | Baseline | MCP-Augmented | Change | |--------|----------|---------------|--------| | Resolution Rate | 49.8% | 42.4% | -14.9% | | Tool Calls | — | — | -42.3% | | Tokens Used | — | — | -14.0% | | Cost per Task | — | — | -15.2% |

MCP tools alter the agent's exploration strategy, trading general-purpose search for opinionated shortcuts. The effect varies by codebase: the server helped on 1 of 12 repositories and hurt on 10, revealing an efficiency-resolution tradeoff that developers should evaluate before deploying MCP tools in production.

Methodology: Paired comparison experiments with Docker-isolated task environments, pinned dependencies, and identical model configurations. The only variable is the presence of MCP tools.

<details> <summary>Cite this work</summary>
@software{newell2025mcpbr,
  author    = {Newell, Grey},
  title     = {mcpbr: Benchmarking Model Context Protocol Servers on Software Engineering Tasks},
  year      = {2026},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.18627369},
  url       = {https://doi.org/10.5281/zenodo.18627369}
}
</details>

Supported Benchmarks

mcpbr supports 30+ benchmarks across 10 categories through a flexible abstraction layer:

| Category | Benchmarks | |----------|-----------| | Software Engineering | SWE-bench (Verified/Lite/Full), APPS, CodeContests, BigCodeBench, LeetCode, CoderEval, Aider Polyglot | | Code Generation | HumanEval, MBPP | | Math & Reasoning | GSM8K, MATH, BigBench-Hard | | Knowledge & QA | TruthfulQA, HellaSwag, ARC, GAIA | | Tool Use & Agents | MCPToolBench++, ToolBench, AgentBench, WebArena, TerminalBench, InterCode | | ML Research | MLAgentBench | | Code Understanding | RepoQA | | Multimodal | MMMU | | Long Context | LongBench | | Safety & Adversarial | Adversarial (HarmBench) | | Security | CyberGym | | Custom | User-defined benchmarks via YAML |

Featured Benchmarks

SWE-bench (Default) - Real GitHub issues requiring bug fixes. Three variants: Verified (500 manually validated), Lite (300 curated), and Full (2,294 complete). Pre-built Docker images available.

CyberGym - Security vulnerabilities requiring PoC exploits. 4 difficulty levels controlling context. Uses AddressSanitizer for crash detection.

MCPToolBench++ - Large-scale MCP tool use evaluation across 45+ categories. Tests tool discovery, selection, invocation, and result interpretation.

GSM8K - Grade-school math word problems testing chain-of-thought reasoning with numeric answer matching.

# Run SWE-bench Verified (default)
mcpbr run -c config.yaml

# Run any benchmark
mcpbr run -c config.yaml --benchmark humaneval -n 20
mcpbr run -c config.yaml --benchmark gsm8k -n 50
mcpbr run -c config.yaml --benchmark cybergym --level 2

# List all available benchmarks
mcpbr benchmarks

See the benchmarks guide for details on each benchmark and how to configure them.

Overview

This harness runs two parallel evaluations for each task:

  1. MCP Agent: LLM with access to tools from your MCP server
  2. Baseline Agent: LLM without tools (single-shot generation)

By comparing these, you can measure the effectiveness of your MCP server for different software engineering tasks. See the MCP integration guide for tips on testing your server.

Regression Detection

mcpbr includes built-in regression detection to catch performance degradations between MCP server versions:

Key Features

  • Automatic Detection: Compare current results against a baseline to identify regressions
  • Detailed Reports: See exactly which tasks regressed and which improved
  • Threshold-Based Exit Codes: Fail CI/CD pipelines when regression rate exceeds acceptable limits
  • Multi-Channel Alerts: Send notifications via Slack, Discord, or email

How It Works

A regression is detected when a task that passed in the baseline now fails in the current run. This helps you catch issues before deploying new versions of your MCP server.

# First, run a baseline evaluation and save res
View on GitHub
GitHub Stars7
CategoryDevelopment
Updated1d ago
Forks2

Languages

Python

Security Score

90/100

Audited on Mar 18, 2026

No findings