SkillAgentSearch skills...

Webclaw

Fast, local-first web content extraction for LLMs. Scrape, crawl, extract structured data — all from Rust. CLI, REST API, and MCP server.

Install / Use

/learn @0xMassi/Webclaw

README

<p align="center"> <a href="https://webclaw.io"> <img src=".github/banner.png" alt="webclaw" width="700" /> </a> </p> <h3 align="center"> The fastest web scraper for AI agents.<br/> <sub>67% fewer tokens. Sub-millisecond extraction. Zero browser overhead.</sub> </h3> <p align="center"> <a href="https://github.com/0xMassi/webclaw/stargazers"><img src="https://img.shields.io/github/stars/0xMassi/webclaw?style=for-the-badge&logo=github&logoColor=white&label=Stars&color=181717" alt="Stars" /></a> <a href="https://github.com/0xMassi/webclaw/releases"><img src="https://img.shields.io/github/v/release/0xMassi/webclaw?style=for-the-badge&logo=rust&logoColor=white&label=Version&color=B7410E" alt="Version" /></a> <a href="https://github.com/0xMassi/webclaw/blob/main/LICENSE"><img src="https://img.shields.io/badge/License-MIT-10B981?style=for-the-badge" alt="License" /></a> <a href="https://www.npmjs.com/package/create-webclaw"><img src="https://img.shields.io/npm/dt/create-webclaw?style=for-the-badge&logo=npm&logoColor=white&label=Installs&color=CB3837" alt="npm installs" /></a> </p> <p align="center"> <a href="https://discord.gg/KDfd48EpnW"><img src="https://img.shields.io/badge/Discord-Join-5865F2?style=for-the-badge&logo=discord&logoColor=white" alt="Discord" /></a> <a href="https://x.com/webclaw_io"><img src="https://img.shields.io/badge/Follow-@webclaw__io-000000?style=for-the-badge&logo=x&logoColor=white" alt="X / Twitter" /></a> <a href="https://webclaw.io"><img src="https://img.shields.io/badge/Website-webclaw.io-0A0A0A?style=for-the-badge&logo=safari&logoColor=white" alt="Website" /></a> <a href="https://webclaw.io/docs"><img src="https://img.shields.io/badge/Docs-Read-3B82F6?style=for-the-badge&logo=readthedocs&logoColor=white" alt="Docs" /></a> </p>
<p align="center"> <img src="assets/demo.gif" alt="Claude Code: web_fetch gets 403, webclaw extracts successfully" width="700" /> <br/> <sub>Claude Code's built-in web_fetch → 403 Forbidden. webclaw → clean markdown.</sub> </p>

Your AI agent calls fetch() and gets a 403. Or 142KB of raw HTML that burns through your token budget. webclaw fixes both.

It extracts clean, structured content from any URL using Chrome-level TLS fingerprinting — no headless browser, no Selenium, no Puppeteer. Output is optimized for LLMs: 67% fewer tokens than raw HTML, with metadata, links, and images preserved.

                     Raw HTML                          webclaw
┌──────────────────────────────────┐    ┌──────────────────────────────────┐
│ <div class="ad-wrapper">         │    │ # Breaking: AI Breakthrough      │
│ <nav class="global-nav">         │    │                                  │
│ <script>window.__NEXT_DATA__     │    │ Researchers achieved 94%         │
│ ={...8KB of JSON...}</script>    │    │ accuracy on cross-domain         │
│ <div class="social-share">       │    │ reasoning benchmarks.            │
│ <button>Tweet</button>           │    │                                  │
│ <footer class="site-footer">     │    │ ## Key Findings                  │
│ <!-- 142,847 characters -->      │    │ - 3x faster inference            │
│                                  │    │ - Open-source weights            │
│         4,820 tokens             │    │         1,590 tokens             │
└──────────────────────────────────┘    └──────────────────────────────────┘

Get Started (30 seconds)

For AI agents (Claude, Cursor, Windsurf, VS Code)

npx create-webclaw

Auto-detects your AI tools, downloads the MCP server, and configures everything. One command.

Homebrew (macOS/Linux)

brew tap 0xMassi/webclaw
brew install webclaw

Prebuilt binaries

Download from GitHub Releases for macOS (arm64, x86_64) and Linux (x86_64, aarch64).

Cargo (from source)

cargo install --git https://github.com/0xMassi/webclaw.git webclaw-cli
cargo install --git https://github.com/0xMassi/webclaw.git webclaw-mcp

Docker

docker run --rm ghcr.io/0xmassi/webclaw https://example.com

Docker Compose (with Ollama for LLM features)

cp env.example .env
docker compose up -d

Why webclaw?

| | webclaw | Firecrawl | Trafilatura | Readability | |---|:---:|:---:|:---:|:---:| | Extraction accuracy | 95.1% | — | 80.6% | 83.5% | | Token efficiency | -67% | — | -55% | -51% | | Speed (100KB page) | 3.2ms | ~500ms | 18.4ms | 8.7ms | | TLS fingerprinting | Yes | No | No | No | | Self-hosted | Yes | No | Yes | Yes | | MCP (Claude/Cursor) | Yes | No | No | No | | No browser required | Yes | No | Yes | Yes | | Cost | Free | $$$$ | Free | Free |

Choose webclaw if you want fast local extraction, LLM-optimized output, and native AI agent integration.


What it looks like

$ webclaw https://stripe.com -f llm

> URL: https://stripe.com
> Title: Stripe | Financial Infrastructure for the Internet
> Language: en
> Word count: 847

# Stripe | Financial Infrastructure for the Internet

Stripe is a suite of APIs powering online payment processing
and commerce solutions for internet businesses of all sizes.

## Products
- Payments — Accept payments online and in person
- Billing — Manage subscriptions and invoicing
- Connect — Build a marketplace or platform
...
$ webclaw https://github.com --brand

{
  "name": "GitHub",
  "colors": [{"hex": "#59636E", "usage": "Primary"}, ...],
  "fonts": ["Mona Sans", "ui-monospace"],
  "logos": [{"url": "https://github.githubassets.com/...", "kind": "svg"}]
}
$ webclaw https://docs.rust-lang.org --crawl --depth 2 --max-pages 50

Crawling... 50/50 pages extracted
---
# Page 1: https://docs.rust-lang.org/
...
# Page 2: https://docs.rust-lang.org/book/
...

MCP Server — 10 tools for AI agents

<a href="https://glama.ai/mcp/servers/0xMassi/webclaw"><img src="https://glama.ai/mcp/servers/0xMassi/webclaw/badge" alt="webclaw MCP server" /></a>

webclaw ships as an MCP server that plugs into Claude Desktop, Claude Code, Cursor, Windsurf, OpenCode, Antigravity, Codex CLI, and any MCP-compatible client.

npx create-webclaw    # auto-detects and configures everything

Or manual setup — add to your Claude Desktop config:

{
  "mcpServers": {
    "webclaw": {
      "command": "~/.webclaw/webclaw-mcp"
    }
  }
}

Then in Claude: "Scrape the top 5 results for 'web scraping tools' and compare their pricing" — it just works.

Available tools

| Tool | Description | Requires API key? | |------|-------------|:-:| | scrape | Extract content from any URL | No | | crawl | Recursive site crawl | No | | map | Discover URLs from sitemaps | No | | batch | Parallel multi-URL extraction | No | | extract | LLM-powered structured extraction | No (needs Ollama) | | summarize | Page summarization | No (needs Ollama) | | diff | Content change detection | No | | brand | Brand identity extraction | No | | search | Web search + scrape results | Yes | | research | Deep multi-source research | Yes |

8 of 10 tools work locally — no account, no API key, fully private.


Features

Extraction

  • Readability scoring — multi-signal content detection (text density, semantic tags, link ratio)
  • Noise filtering — strips nav, footer, ads, modals, cookie banners (Tailwind-safe)
  • Data island extraction — catches React/Next.js JSON payloads, JSON-LD, hydration data
  • YouTube metadata — structured data from any YouTube video
  • PDF extraction — auto-detected via Content-Type
  • 5 output formats — markdown, text, JSON, LLM-optimized, HTML

Content control

webclaw URL --include "article, .content"       # CSS selector include
webclaw URL --exclude "nav, footer, .sidebar"    # CSS selector exclude
webclaw URL --only-main-content                  # Auto-detect main content

Crawling

webclaw URL --crawl --depth 3 --max-pages 100   # BFS same-origin crawl
webclaw URL --crawl --sitemap                    # Seed from sitemap
webclaw URL --map                                # Discover URLs only

LLM features (Ollama / OpenAI / Anthropic)

webclaw URL --summarize                          # Page summary
webclaw URL --extract-prompt "Get all prices"    # Natural language extraction
webclaw URL --extract-json '{"type":"object"}'   # Schema-enforced extraction

Change tracking

webclaw URL -f json > snap.json                  # Take snapshot
webclaw URL --diff-with snap.json                # Compare later

Brand extraction

webclaw URL --brand                              # Colors, fonts, logos, OG image

Proxy rotation

webclaw URL --proxy http://user:pass@host:port   # Single proxy
webclaw URLs --proxy-file proxies.txt            # Pool rotation

Benchmarks

All numbers from real tests on 50 diverse pages. See benchmarks/ for methodology and reproduction instructions.

Extraction quality

Accuracy      webclaw     ███████████████████ 95.1%
              readability ████████████████▋   83.5%
              trafilatura ████████████████    80.6%
              newspaper3k █████████████▎      66.4%

Noise removal webclaw     ███████████████████ 96.1%
              readability █████████████████▊  89.4%
              trafilatura ██████████████████▏ 91.2%
              newspaper3k ███████████████▎    76.8%

Speed (pure extraction, no network)

10KB page     webclaw     ██                   0.8ms
              readability █████                2.1ms
              trafilatura ██████████           4.3ms

100KB page    webclaw     ██                   3.2ms
              readability █████                8.7ms
              trafilatura ██████████           18.4ms

Token efficiency (feeding to Claude/GPT)

| Format | Tokens | vs R

Related Skills

View on GitHub
GitHub Stars241
CategoryDevelopment
Updated19m ago
Forks17

Languages

Rust

Security Score

100/100

Audited on Mar 30, 2026

No findings