Skill

PinchBench is a benchmarking system for evaluating LLM models as OpenClaw coding agents. Made with 🦀 by the humans at https://kilo.ai

Generate Convert Improve

Install / Use

/learn @pinchbench/Skill

About this skill

Quality Score

0/100

README

🦀 PinchBench

Real-world benchmarks for AI coding agents

Note: This repository contains the benchmark skill/tasks. It is NOT the source of official leaderboard results. To add models to the official results, modify pinchbench/scripts/default-models.yml.

PinchBench measures how well LLM models perform as the brain of an OpenClaw agent. Instead of synthetic tests, we throw real tasks at agents: scheduling meetings, writing code, triaging email, researching topics, and managing files.

Results are collected on a public leaderboard at pinchbench.com.

PinchBench

Why PinchBench?

Most LLM benchmarks test isolated capabilities. PinchBench tests what actually matters for coding agents:

Tool usage — Can the model call the right tools with the right parameters?
Multi-step reasoning — Can it chain together actions to complete complex tasks?
Real-world messiness — Can it handle ambiguous instructions and incomplete information?
Practical outcomes — Did it actually create the file, send the email, or schedule the meeting?

Quick Start

# Clone the skill
git clone https://github.com/pinchbench/skill.git
cd skill

# Run benchmarks with your model of choice
./scripts/run.sh --model openrouter/anthropic/claude-sonnet-4

# Or run specific tasks
./scripts/run.sh --model openrouter/openai/gpt-4o --suite task_01_calendar,task_02_stock

Note: Model IDs must include their provider prefix (e.g. openrouter/, anthropic/). OpenRouter is the default provider used for routing.

Requirements:

Python 3.10+
uv package manager
A running OpenClaw instance

What Gets Tested

PinchBench includes 23 tasks across real-world categories:

| Category | Tasks | What's tested | | ---------------- | --------------------------------------- | ---------------------------------------- | | Productivity | Calendar, daily summaries | Event creation, time parsing, scheduling | | Research | Stock prices, conferences, markets | Web search, data extraction, synthesis | | Writing | Blog posts, emails, humanization | Content generation, tone, formatting | | Coding | Weather scripts, file structures | Code generation, file operations | | Analysis | Spreadsheets, PDFs, documents | Data processing, summarization | | Email | Triage, search | Inbox management, filtering | | Memory | Context retrieval, knowledge management | Long-term memory, recall | | Skills | ClawHub, skill discovery | OpenClaw ecosystem integration |

Each task is graded automatically, by an LLM judge, or both — ensuring both objective and nuanced evaluation.

Submitting Results

To get your results on the leaderboard:

# Register for an API token (one-time)
./scripts/run.sh --register

# Run benchmark — results auto-upload with your token
./scripts/run.sh --model openrouter/anthropic/claude-sonnet-4

Skip uploading with --no-upload if you just want local results.

Official Results

To submit an official run (marked on the leaderboard):

# Using environment variable
export PINCHBENCH_OFFICIAL_KEY=your_official_key
./scripts/run.sh --model anthropic/claude-sonnet-4

# Using command line flag
./scripts/run.sh --model anthropic/claude-sonnet-4 --official-key your_official_key

Command Reference

| Flag | Description | | ------------------------ | ----------------------------------------------------------------------------- | | --model MODEL | Model to test (e.g., openrouter/anthropic/claude-sonnet-4) | | --judge MODEL | Judge model for LLM grading (default: openrouter/anthropic/claude-opus-4.5) | | --suite SUITE | all, automated-only, or comma-separated task IDs | | --runs N | Number of runs per task for averaging | | --timeout-multiplier N | Scale timeouts for slower models | | --output-dir DIR | Where to save results (default: results/) | | --no-upload | Skip uploading to leaderboard | | --register | Request an API token for submissions | | --upload FILE | Upload a previous results JSON | | --official-key KEY | Mark submission as official (or use PINCHBENCH_OFFICIAL_KEY env var) |

Contributing Tasks

We welcome new tasks! Check out tasks/TASK_TEMPLATE.md for the format. Good tasks are:

Real-world — Something an actual user would ask an agent to do
Measurable — Clear success criteria that can be graded
Reproducible — Same task should produce consistent grading
Challenging — Tests agent capabilities, not just LLM knowledge

License

MIT — see LICENSE for details.

Claw-some AI agent testing 🦞

Related Skills

node-connect

339.5k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

83.9k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

339.5k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

commit-push-pr

83.9k

Commit, push, and open a PR

pinchbench

View profile

View on GitHub

GitHub Stars829

CategoryDevelopment

Updated25m ago

Forks79

pinchbench/skill

Languages

Python

Security Score

95/100

Audited on Mar 29, 2026

No findings