LiveMCPBench
LiveMCPBench is a benchmark for evaluating the ability of agents to navigate and utilize a large-scale MCP toolset. It provides a comprehensive set of tasks that challenge agents to effectively use various tools in daily scenarios.
Install / Use
/learn @icip-cas/LiveMCPBenchQuality Score
Category
Development & EngineeringSupported Platforms
README
<a id="readme-top"></a>
<!-- PROJECT --> <br /> <div align="center"> <h3 align="center">LiveMCPBench: Can Agents Navigate an Ocean of MCP Tools?</h3> <p align="center"> Benchmarking the agent in real-world tasks within a large-scale MCP toolset. </p> </div> <p align="center"> <a href="https://www.python.org/downloads/release/python-31113/"><img src="https://img.shields.io/badge/python-3.11-blue.svg" alt="Python 3.11"></a> <a href="https://github.com/astral-sh/ruff"><img src="https://img.shields.io/badge/code%20style-ruff-000000.svg" alt="Code style: ruff"></a> </p> <p align="center"> 🌐 <a href="https://icip-cas.github.io/LiveMCPBench" target="_blank">Website</a> | 📄 <a href="https://arxiv.org/abs/2508.01780" target="_blank">Paper</a> | 🤗 <a href="https://huggingface.co/datasets/ICIP/LiveMCPBench" target="_blank">Dataset</a> | 🐳 <a href="https://hub.docker.com/r/hysdhlx/livemcpbench" target="_blank">Docker</a> | 🏆 <a href="https://docs.google.com/spreadsheets/d/1EXpgXq1VKw5A7l7-N2E9xt3w0eLJ2YPVPT-VrRxKZBw/edit?usp=sharing" target="_blank">Leaderboard</a> | 🙏 <a href="#citation" target="_blank">Citation</a> </p>
News
- [8/18/2025] We releas Docker images and add evaluation results in leaderboard for three new models: GLM 4.5, GPT-5-Mini, and Kimi-K2.
- [8/3/2025] We release the LiveMCPBench.
Getting Started
Prerequisites
We recommend using our docker image, but if you want to run the code locally, you will need to install the following tools:
- npm
- uv
Installation
-
Pull the docker image
docker pull hysdhlx/livemcpbench:latest -
Git the repo and run the docker image
git clone https://github.com/icip-cas/LiveMCPBench.git cd LiveMCPBench docker run -itd \ -v "$(pwd):/outside" \ --gpus all \ --ipc=host \ --net=host \ --name LiveMCPBench_container \ hysdhlx/livemcpbench:latest \ bash -
Prepare the .env file
cp .env_template .envYou can modify the .env file to set your own environment variables.
# MCP Copilot Agent Configuration BASE_URL= OPENAI_API_KEY= MODEL= # Tool Retrieval Configuration EMBEDDING_MODEL= EMBEDDING_BASE_URL= EMBEDDING_API_KEY= EMBEDDING_DIMENSIONS=1024 TOP_SERVERS=5 TOP_TOOLS=3 # Abstract API Configuration (optional) ABSTRACT_MODEL= ABSTRACT_API_KEY= ABSTRACT_BASE_URL= # Proxy Configuration (optional) http_proxy= https_proxy= no_proxy=127.0.0.1,localhost HTTP_PROXY= HTTPS_PROXY= NO_PROXY=127.0.0.1,localhost # lark report (optional) LARK_WEBHOOK_URL= -
Enter the container & Reset the environment
As we have mounted the code repo to
/outside, you can access the code repo in the container at/outside/.docker exec -it LiveMCPBench_container bashBecause the agent may change the environment, we recommend resetting the environment before running the agent. To reset the environment, you can run the following command:
cd /LiveMCPBench/ bash scripts/env_reset.shThis will copy the repo code in
/outsideto/LiveMCPBenchand link theannotated_datato/root/. -
Check the MCP tools
bash ./tools/scripts/tool_check.shAfter running this command, you can check
./tools/test/tools.jsonto see the tools.You could run this script multiple times if you find some tools are not working.
-
Index the servers
The MCP Copilot Agent requires you have indexed the servers before running. You can run the following command to warm up the agent:
uv run -m baseline.mcp_copilot.arg_generation
Quick Start
MCP Copilot Agent
Example Run
bash ./baseline/scripts/run_example.sh
This will run the agent with a simple example and save the results in ./baseline/output/.
Full Run
We default use /root dir to store our data that the agent will access. If you want to run locally, you need to ensure the file in the right path.
-
Run the MCP Copilot Agent
Be sure you have set the environment variables in the .env file.
bash ./baseline/scripts/run_baselines.sh -
Check the results
After running the agent, you can check the trajectories in
./baseline/output.
Evaluation using the LiveMCPEval
-
Modify the
MODELin .env to change evluation models -
Run the evaluation script
bash ./evaluator/scripts/run_baseline.sh -
Check the results
After running the evaluation, you can check the results in
./evaluator/output. -
Calculate the success rate
uv run ./evaluator/stat_success_rate.py --result_path /path/to/evaluation/
Project Structure
LiveMCPBench/
├── annotated_data/ # Tasks and task files
├── baseline/ # MCP Copilot Agent
│ ├── scripts/ # Scripts for running the agent
│ ├── output/ # Output for the agent
│ └── mcp_copilot/ # Source code for the agent
├── evaluator/ # LiveMCPEval
│ ├── scripts/ # Scripts for evaluation
│ └── output/ # Output for evaluation
├── tools/ # LiveMCPTool
│ ├── LiveMCPTool/ # Tool data
│ └── scripts/ # Scripts for the tools
├── scripts/ # Path prepare scripts
├── utils/ # Utility functions
└── .env_template # Template for environment
Citation
If you find this project helpful, please use the following to cite it:
@misc{mo2025livemcpbenchagentsnavigateocean,
title={LiveMCPBench: Can Agents Navigate an Ocean of MCP Tools?},
author={Guozhao Mo and Wenliang Zhong and Jiawei Chen and Xuanang Chen and Yaojie Lu and Hongyu Lin and Ben He and Xianpei Han and Le Sun},
year={2025},
eprint={2508.01780},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2508.01780},
}
