mcpbr

# One-liner install (installs + runs quick test)
curl -sSL https://raw.githubusercontent.com/greynewell/mcpbr/main/install.sh | bash

# Or install and run manually
pip install mcpbr && mcpbr run -n 1

Benchmark your MCP server against real GitHub issues. One command, hard numbers.

Model Context Protocol Benchmark Runner

CodeRabbit Pull Request Reviews

Stop guessing if your MCP server actually helps. Get hard numbers comparing tool-assisted vs. baseline agent performance on real GitHub issues.

⭐ Star the Supermodel Ecosystem

If this is useful, please star our tools — it helps us grow:

What You Get

Real metrics showing whether your MCP server improves agent performance on SWE-bench tasks. No vibes, just data.

Why mcpbr?

MCP servers promise to make LLMs better at coding tasks. But how do you prove it?

mcpbr runs controlled experiments: same model, same tasks, same environment - the only variable is your MCP server. You get:

Apples-to-apples comparison against a baseline agent
Real GitHub issues from SWE-bench (not toy examples)
Reproducible results via Docker containers with pinned dependencies

Blog

Research Paper

mcpbr: Benchmarking Model Context Protocol Servers on Software Engineering Tasks Grey Newell, Georgia Institute of Technology, 2026

We evaluated a code graph analysis MCP server on all 500 tasks from SWE-bench Verified using Claude Sonnet as the base agent. Key findings:

| Metric | Baseline | MCP-Augmented | Change | |--------|----------|---------------|--------| | Resolution Rate | 49.8% | 42.4% | -14.9% | | Tool Calls | — | — | -42.3% | | Tokens Used | — | — | -14.0% | | Cost per Task | — | — | -15.2% |

MCP tools alter the agent's exploration strategy, trading general-purpose search for opinionated shortcuts. The effect varies by codebase: the server helped on 1 of 12 repositories and hurt on 10, revealing an efficiency-resolution tradeoff that developers should evaluate before deploying MCP tools in production.

Methodology: Paired comparison experiments with Docker-isolated task environments, pinned dependencies, and identical model configurations. The only variable is the presence of MCP tools.

@software{newell2025mcpbr,
  author    = {Newell, Grey},
  title     = {mcpbr: Benchmarking Model Context Protocol Servers on Software Engineering Tasks},
  year      = {2026},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.18627369},
  url       = {https://doi.org/10.5281/zenodo.18627369}
}

</details>

Supported Benchmarks

mcpbr supports 30+ benchmarks across 10 categories through a flexible abstraction layer:

| Category | Benchmarks | |----------|-----------| | Software Engineering | SWE-bench (Verified/Lite/Full), APPS, CodeContests, BigCodeBench, LeetCode, CoderEval, Aider Polyglot | | Code Generation | HumanEval, MBPP | | Math & Reasoning | GSM8K, MATH, BigBench-Hard | | Knowledge & QA | TruthfulQA, HellaSwag, ARC, GAIA | | Tool Use & Agents | MCPToolBench++, ToolBench, AgentBench, WebArena, TerminalBench, InterCode | | ML Research | MLAgentBench | | Code Understanding | RepoQA | | Multimodal | MMMU | | Long Context | LongBench | | Safety & Adversarial | Adversarial (HarmBench) | | Security | CyberGym | | Custom | User-defined benchmarks via YAML |

Featured Benchmarks

SWE-bench (Default) - Real GitHub issues requiring bug fixes. Three variants: Verified (500 manually validated), Lite (300 curated), and Full (2,294 complete). Pre-built Docker images available.

CyberGym - Security vulnerabilities requiring PoC exploits. 4 difficulty levels controlling context. Uses AddressSanitizer for crash detection.

MCPToolBench++ - Large-scale MCP tool use evaluation across 45+ categories. Tests tool discovery, selection, invocation, and result interpretation.

GSM8K - Grade-school math word problems testing chain-of-thought reasoning with numeric answer matching.

# Run SWE-bench Verified (default)
mcpbr run -c config.yaml

# Run any benchmark
mcpbr run -c config.yaml --benchmark humaneval -n 20
mcpbr run -c config.yaml --benchmark gsm8k -n 50
mcpbr run -c config.yaml --benchmark cybergym --level 2

# List all available benchmarks
mcpbr benchmarks

See the benchmarks guide for details on each benchmark and how to configure them.

Overview

This harness runs two parallel evaluations for each task:

MCP Agent: LLM with access to tools from your MCP server
Baseline Agent: LLM without tools (single-shot generation)

By comparing these, you can measure the effectiveness of your MCP server for different software engineering tasks. See the MCP integration guide for tips on testing your server.

Regression Detection

mcpbr includes built-in regression detection to catch performance degradations between MCP server versions:

Key Features

Automatic Detection: Compare current results against a baseline to identify regressions
Detailed Reports: See exactly which tasks regressed and which improved
Threshold-Based Exit Codes: Fail CI/CD pipelines when regression rate exceeds acceptable limits
Multi-Channel Alerts: Send notifications via Slack, Discord, or email

How It Works

A regression is detected when a task that passed in the baseline now fails in the current run. This helps you catch issues before deploying new versions of your MCP server.

# First, run a baseline evaluation and save res

Mcpbr

Install / Use

README