OpenClawProBench
OpenClawProBench is a live-first benchmark harness for evaluating LLM agents in the OpenClaw runtime with deterministic grading and repeated-trial reliability.
Install / Use
/learn @suyoumo/OpenClawProBenchREADME
OpenClawProBench
</div>Transparent live-first benchmark harness for evaluating model capability inside the OpenClaw runtime. <br> 102 active scenarios, 162 catalog scenarios, deterministic grading, and OpenClaw-native coverage.
OpenClawProBench focuses on real OpenClaw execution with deterministic grading, structured reports, and benchmark-profile selection. The default ranking path is the core profile; broader active coverage remains available through intelligence, coverage, native, and full.
The current worktree inventory reports 102 active scenarios and 162 total catalog scenarios (60 incubating) via python3 run.py inventory --json and python3 run.py inventory --benchmark-status all --json.
Leaderboard
Browse the public leaderboard and benchmark cases at suyoumo.github.io/bench.
📢 Updates
v1.0.1- Addedqwen3-coder-next,doubao-seed-code,qwen3-max-2026-01-23, andqwen3.6plusrerun withbailiancodingplan; added model image download and benchmark sharing to Twitter; fixed completed-report resume overwrite,tool_use_14graceful fallback on skills inventory load failure,tool_use_17invalid JSON and missing-file tolerance, andaudit_scenario_quality.pycompatibility.v1.0.0- OpenClawProBench released with 102 tasks across 6 domains, with 3-try runs, checkpoint resume, and cross-environment resume support.
Evaluation Logic
- Default ranking path:
core - Extended active capability suite:
intelligence - Native-only slice:
native - Multi-trial runs are supported via
--trials N - Reports expose
avg_score,max_score, coverage-aware summaries, cost, latency, and resume metadata - Interrupted runs can continue with
--continueor--resume-from, and execution failures can be re-queued with--rerun-execution-failures
Quick Start
We recommend using uv for fast, reliable Python environment setup:
pip install uv
uv venv --python 3.11
source .venv/bin/activate
uv pip install -r requirements.txt
Before running the benchmark, make sure your local OpenClaw runtime is available:
openclaw --help
openclaw agents list --json
Inspect the benchmark catalog and validate the scenario set:
python3 run.py inventory
python3 run.py inventory --json
python3 run.py dry
Run a one-trial smoke on the default ranking benchmark:
python3 run.py run \
--model '<MODEL>' \
--execution-mode live \
--benchmark-profile core \
--trials 1 \
--cleanup-agents
Run the full default benchmark:
python3 run.py run \
--model '<MODEL>' \
--execution-mode live \
--benchmark-profile core \
--trials 3 \
--cleanup-agents
Compare generated reports:
python3 run.py compare --results-dir results
For isolated same-host runs, the harness also supports:
--openclaw-profile--openclaw-state-dir--openclaw-config-path--openclaw-gateway-port--openclaw-binary
Benchmark Profiles
| Profile | Active scenarios | Purpose |
| --- | ---: | --- |
| core | 26 | Default ranking suite |
| intelligence | 95 | Extended active capability benchmark |
| coverage | 7 | Lower-stakes breadth and regression slice |
| native | 36 | Active OpenClaw-native slice only |
| full | 102 | Union of all active scenarios |
The benchmark catalog also includes 60 incubating scenarios that can be inspected with --benchmark-status all.
OpenClaw Runtime
Live runs expect a working local openclaw CLI plus the auth and config required by the surfaces exercised by the selected scenarios. If your binary is not on PATH, set OPENCLAW_BINARY or pass --openclaw-binary.
config/openclaw.json.template is provided as a reference template for local OpenClaw configuration and isolated-run setups.
Repo Map
run.py: CLI entrypoint forinventory,dry,run, andcompareharness/: loader, runner, scoring, reporting, and live OpenClaw bridgescenarios/: benchmark tasks in YAMLdatasets/: seeded live-task data and optional setup / teardown scriptscustom_checks/: scenario-specific grading logictests/: regression coverage for loader, runner, scoring, and reportingdocs/: public assets plus evaluation validation and benchmark-profile policy
Generated Output
Benchmark reports are written to results/. They are generated runtime artifacts and are intentionally ignored by version control in this repo layout.
Citation
If you use OpenClawProBench in your research, please cite:
@misc{openclawprobench2026,
title={OpenClawProBench — a transparent benchmark for true intelligence in real-world AI agents.},
author={suyoumo},
year={2026},
url={https://github.com/suyoumo/OpenClawProBench}
}
Contribution
We welcome issues, documentation fixes, scenario improvements, grader hardening, and benchmark-engine contributions. See CONTRIBUTING.md for setup and validation guidance.
Acknowledgements
This project was informed by prior open-source work on agent evaluation, benchmark design, and real-world task assessment.
We drew ideas from projects such as PinchBench, Claw-Eval, AgencyBench, and related agent-benchmark efforts, especially in areas like task design, evaluation methodology, harness structure, and public benchmark presentation.
Some tasks in this repository are adapted and reworked from earlier public benchmark-style task sets into the OpenClaw runtime and grading framework.
Contributors
Public contributor list: waiting.
Discussion Group

Join our WeChat discussion group to discuss OpenClaw with other users and builders.
Related Skills
node-connect
354.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
112.2kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
354.0kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
354.0kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。

