SkillAgentSearch skills...

Skill

PinchBench is a benchmarking system for evaluating LLM models as OpenClaw coding agents. Made with 🦀 by the humans at https://kilo.ai

Install / Use

/learn @pinchbench/Skill
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

🦀 PinchBench

Real-world benchmarks for AI coding agents

Leaderboard License

Note: This repository contains the benchmark skill/tasks. It is NOT the source of official leaderboard results. To add models to the official results, modify pinchbench/scripts/default-models.yml.

PinchBench measures how well LLM models perform as the brain of an OpenClaw agent. Instead of synthetic tests, we throw real tasks at agents: scheduling meetings, writing code, triaging email, researching topics, and managing files.

Results are collected on a public leaderboard at pinchbench.com.

PinchBench

Why PinchBench?

Most LLM benchmarks test isolated capabilities. PinchBench tests what actually matters for coding agents:

  • Tool usage — Can the model call the right tools with the right parameters?
  • Multi-step reasoning — Can it chain together actions to complete complex tasks?
  • Real-world messiness — Can it handle ambiguous instructions and incomplete information?
  • Practical outcomes — Did it actually create the file, send the email, or schedule the meeting?

Quick Start

# Clone the skill
git clone https://github.com/pinchbench/skill.git
cd skill

# Run benchmarks with your model of choice
./scripts/run.sh --model openrouter/anthropic/claude-sonnet-4

# Or run specific tasks
./scripts/run.sh --model openrouter/openai/gpt-4o --suite task_01_calendar,task_02_stock

Note: Model IDs must include their provider prefix (e.g. openrouter/, anthropic/). OpenRouter is the default provider used for routing.

Requirements:

  • Python 3.10+
  • uv package manager
  • A running OpenClaw instance

What Gets Tested

PinchBench includes 23 tasks across real-world categories:

| Category | Tasks | What's tested | | ---------------- | --------------------------------------- | ---------------------------------------- | | Productivity | Calendar, daily summaries | Event creation, time parsing, scheduling | | Research | Stock prices, conferences, markets | Web search, data extraction, synthesis | | Writing | Blog posts, emails, humanization | Content generation, tone, formatting | | Coding | Weather scripts, file structures | Code generation, file operations | | Analysis | Spreadsheets, PDFs, documents | Data processing, summarization | | Email | Triage, search | Inbox management, filtering | | Memory | Context retrieval, knowledge management | Long-term memory, recall | | Skills | ClawHub, skill discovery | OpenClaw ecosystem integration |

Each task is graded automatically, by an LLM judge, or both — ensuring both objective and nuanced evaluation.

Submitting Results

To get your results on the leaderboard:

# Register for an API token (one-time)
./scripts/run.sh --register

# Run benchmark — results auto-upload with your token
./scripts/run.sh --model openrouter/anthropic/claude-sonnet-4

Skip uploading with --no-upload if you just want local results.

Official Results

To submit an official run (marked on the leaderboard):

# Using environment variable
export PINCHBENCH_OFFICIAL_KEY=your_official_key
./scripts/run.sh --model anthropic/claude-sonnet-4

# Using command line flag
./scripts/run.sh --model anthropic/claude-sonnet-4 --official-key your_official_key

Command Reference

| Flag | Description | | ------------------------ | ----------------------------------------------------------------------------- | | --model MODEL | Model to test (e.g., openrouter/anthropic/claude-sonnet-4) | | --judge MODEL | Judge model for LLM grading (default: openrouter/anthropic/claude-opus-4.5) | | --suite SUITE | all, automated-only, or comma-separated task IDs | | --runs N | Number of runs per task for averaging | | --timeout-multiplier N | Scale timeouts for slower models | | --output-dir DIR | Where to save results (default: results/) | | --no-upload | Skip uploading to leaderboard | | --register | Request an API token for submissions | | --upload FILE | Upload a previous results JSON | | --official-key KEY | Mark submission as official (or use PINCHBENCH_OFFICIAL_KEY env var) |

Contributing Tasks

We welcome new tasks! Check out tasks/TASK_TEMPLATE.md for the format. Good tasks are:

  • Real-world — Something an actual user would ask an agent to do
  • Measurable — Clear success criteria that can be graded
  • Reproducible — Same task should produce consistent grading
  • Challenging — Tests agent capabilities, not just LLM knowledge

Links

License

MIT — see LICENSE for details.


Claw-some AI agent testing 🦞

Related Skills

View on GitHub
GitHub Stars829
CategoryDevelopment
Updated25m ago
Forks79

Languages

Python

Security Score

95/100

Audited on Mar 29, 2026

No findings