Mcpbr
Model Context Protocol Benchmark Runner
Install / Use
/learn @supermodeltools/McpbrQuality Score
Category
Development & EngineeringSupported Platforms
README
mcpbr
# One-liner install (installs + runs quick test)
curl -sSL https://raw.githubusercontent.com/greynewell/mcpbr/main/install.sh | bash
# Or install and run manually
pip install mcpbr && mcpbr run -n 1
Benchmark your MCP server against real GitHub issues. One command, hard numbers.
<p align="center"> <img src="https://raw.githubusercontent.com/greynewell/mcpbr/main/assets/mcpbr-logo.jpg" alt="MCPBR Logo" width="400"> </p>
Model Context Protocol Benchmark Runner
<p align="center"> <img src="https://raw.githubusercontent.com/greynewell/mcpbr/main/assets/mcpbr-demo.gif" alt="mcpbr in action" width="700"> </p>Stop guessing if your MCP server actually helps. Get hard numbers comparing tool-assisted vs. baseline agent performance on real GitHub issues.
⭐ Star the Supermodel Ecosystem
If this is useful, please star our tools — it helps us grow:
What You Get
<p align="center"> <img src="https://raw.githubusercontent.com/greynewell/mcpbr/main/assets/mcpbr-eval-results.png" alt="MCPBR Evaluation Results" width="600"> </p>Real metrics showing whether your MCP server improves agent performance on SWE-bench tasks. No vibes, just data.
Why mcpbr?
MCP servers promise to make LLMs better at coding tasks. But how do you prove it?
mcpbr runs controlled experiments: same model, same tasks, same environment - the only variable is your MCP server. You get:
- Apples-to-apples comparison against a baseline agent
- Real GitHub issues from SWE-bench (not toy examples)
- Reproducible results via Docker containers with pinned dependencies
Blog
- SWE-bench Verified Is Broken: 5 Things I Found in the Source Code
- SWE-bench Tests Run 6x Faster on ARM64 with Native Containers
Research Paper
mcpbr: Benchmarking Model Context Protocol Servers on Software Engineering Tasks Grey Newell, Georgia Institute of Technology, 2026
We evaluated a code graph analysis MCP server on all 500 tasks from SWE-bench Verified using Claude Sonnet as the base agent. Key findings:
| Metric | Baseline | MCP-Augmented | Change | |--------|----------|---------------|--------| | Resolution Rate | 49.8% | 42.4% | -14.9% | | Tool Calls | — | — | -42.3% | | Tokens Used | — | — | -14.0% | | Cost per Task | — | — | -15.2% |
MCP tools alter the agent's exploration strategy, trading general-purpose search for opinionated shortcuts. The effect varies by codebase: the server helped on 1 of 12 repositories and hurt on 10, revealing an efficiency-resolution tradeoff that developers should evaluate before deploying MCP tools in production.
Methodology: Paired comparison experiments with Docker-isolated task environments, pinned dependencies, and identical model configurations. The only variable is the presence of MCP tools.
<details> <summary>Cite this work</summary>@software{newell2025mcpbr,
author = {Newell, Grey},
title = {mcpbr: Benchmarking Model Context Protocol Servers on Software Engineering Tasks},
year = {2026},
publisher = {Zenodo},
doi = {10.5281/zenodo.18627369},
url = {https://doi.org/10.5281/zenodo.18627369}
}
</details>
Supported Benchmarks
mcpbr supports 30+ benchmarks across 10 categories through a flexible abstraction layer:
| Category | Benchmarks | |----------|-----------| | Software Engineering | SWE-bench (Verified/Lite/Full), APPS, CodeContests, BigCodeBench, LeetCode, CoderEval, Aider Polyglot | | Code Generation | HumanEval, MBPP | | Math & Reasoning | GSM8K, MATH, BigBench-Hard | | Knowledge & QA | TruthfulQA, HellaSwag, ARC, GAIA | | Tool Use & Agents | MCPToolBench++, ToolBench, AgentBench, WebArena, TerminalBench, InterCode | | ML Research | MLAgentBench | | Code Understanding | RepoQA | | Multimodal | MMMU | | Long Context | LongBench | | Safety & Adversarial | Adversarial (HarmBench) | | Security | CyberGym | | Custom | User-defined benchmarks via YAML |
Featured Benchmarks
SWE-bench (Default) - Real GitHub issues requiring bug fixes. Three variants: Verified (500 manually validated), Lite (300 curated), and Full (2,294 complete). Pre-built Docker images available.
CyberGym - Security vulnerabilities requiring PoC exploits. 4 difficulty levels controlling context. Uses AddressSanitizer for crash detection.
MCPToolBench++ - Large-scale MCP tool use evaluation across 45+ categories. Tests tool discovery, selection, invocation, and result interpretation.
GSM8K - Grade-school math word problems testing chain-of-thought reasoning with numeric answer matching.
# Run SWE-bench Verified (default)
mcpbr run -c config.yaml
# Run any benchmark
mcpbr run -c config.yaml --benchmark humaneval -n 20
mcpbr run -c config.yaml --benchmark gsm8k -n 50
mcpbr run -c config.yaml --benchmark cybergym --level 2
# List all available benchmarks
mcpbr benchmarks
See the benchmarks guide for details on each benchmark and how to configure them.
Overview
This harness runs two parallel evaluations for each task:
- MCP Agent: LLM with access to tools from your MCP server
- Baseline Agent: LLM without tools (single-shot generation)
By comparing these, you can measure the effectiveness of your MCP server for different software engineering tasks. See the MCP integration guide for tips on testing your server.
Regression Detection
mcpbr includes built-in regression detection to catch performance degradations between MCP server versions:
Key Features
- Automatic Detection: Compare current results against a baseline to identify regressions
- Detailed Reports: See exactly which tasks regressed and which improved
- Threshold-Based Exit Codes: Fail CI/CD pipelines when regression rate exceeds acceptable limits
- Multi-Channel Alerts: Send notifications via Slack, Discord, or email
How It Works
A regression is detected when a task that passed in the baseline now fails in the current run. This helps you catch issues before deploying new versions of your MCP server.
# First, run a baseline evaluation and save res
