Reader
Open-source, production-grade web scraping engine built for LLMs. Scrape and crawl the entire web, clean markdown, ready for your agents.
Install / Use
/learn @vakra-dev/ReaderREADME
The Problem
Building agents that need web access is frustrating. You piece together Puppeteer, add stealth plugins, fight Cloudflare, manage proxies and it still breaks in production.
Because production grade web scraping isn't about rendering a page and converting HTML to markdown. It's about everything underneath:
| Layer | What it actually takes | | ------------------------ | ------------------------------------------------------------------- | | Browser architecture | Managing browser instances at scale, not one-off scripts | | Anti-bot bypass | Cloudflare, Turnstile, JS challenges, they all block naive scrapers | | TLS fingerprinting | Real browsers have fingerprints. Puppeteer doesn't. Sites know. | | Proxy infrastructure | Datacenter vs residential, rotation strategies, sticky sessions | | Resource management | Browser pooling, memory limits, graceful recycling | | Reliability | Rate limiting, retries, timeouts, caching, graceful degradation |
I built Reader, a production-grade web scraping engine on top of Ulixee Hero, a headless browser designed for exactly this.
The Solution
Two primitives. That's it.
import { ReaderClient } from "@vakra-dev/reader";
const reader = new ReaderClient();
// Scrape URLs → clean markdown
const result = await reader.scrape({ urls: ["https://example.com"] });
console.log(result.data[0].markdown);
// Crawl a site → discover + scrape pages
const pages = await reader.crawl({
url: "https://example.com",
depth: 2,
scrape: true,
});
console.log(`Found ${pages.urls.length} pages`);
All the hard stuff, browser pooling, challenge detection, proxy rotation, retries, happens under the hood. You get clean markdown. Your agents get the web.
[!TIP] If Reader is useful to you, a star on GitHub helps others discover the project.
Features
- Cloudflare Bypass - TLS fingerprinting, DNS over TLS, WebRTC masking
- Clean Output - Markdown and HTML with automatic main content extraction
- Smart Content Cleaning - Removes nav, headers, footers, popups, cookie banners
- CLI & API - Use from command line or programmatically
- Browser Pool - Auto-recycling, health monitoring, queue management
- Concurrent Scraping - Parallel URL processing with progress tracking
- Website Crawling - BFS link discovery with depth/page limits
- Proxy Support - Datacenter and residential with sticky sessions
Installation
npm install @vakra-dev/reader
Requirements: Node.js >= 18
Quick Start
Basic Scrape
import { ReaderClient } from "@vakra-dev/reader";
const reader = new ReaderClient();
const result = await reader.scrape({
urls: ["https://example.com"],
formats: ["markdown", "html"],
});
console.log(result.data[0].markdown);
console.log(result.data[0].html);
await reader.close();
Batch Scraping with Concurrency
import { ReaderClient } from "@vakra-dev/reader";
const reader = new ReaderClient();
const result = await reader.scrape({
urls: ["https://example.com", "https://example.org", "https://example.net"],
formats: ["markdown"],
batchConcurrency: 3,
onProgress: (progress) => {
console.log(`${progress.completed}/${progress.total}: ${progress.currentUrl}`);
},
});
console.log(`Scraped ${result.batchMetadata.successfulUrls} URLs`);
await reader.close();
Crawling
import { ReaderClient } from "@vakra-dev/reader";
const reader = new ReaderClient();
const result = await reader.crawl({
url: "https://example.com",
depth: 2,
maxPages: 20,
scrape: true,
});
console.log(`Discovered ${result.urls.length} URLs`);
console.log(`Scraped ${result.scraped?.batchMetadata.successfulUrls} pages`);
await reader.close();
With Proxy
import { ReaderClient } from "@vakra-dev/reader";
const reader = new ReaderClient();
const result = await reader.scrape({
urls: ["https://example.com"],
formats: ["markdown"],
proxy: {
type: "residential",
host: "proxy.example.com",
port: 8080,
username: "username",
password: "password",
country: "us",
},
});
await reader.close();
With Proxy Rotation
import { ReaderClient } from "@vakra-dev/reader";
const reader = new ReaderClient({
proxies: [
{ host: "proxy1.example.com", port: 8080, username: "user", password: "pass" },
{ host: "proxy2.example.com", port: 8080, username: "user", password: "pass" },
],
proxyRotation: "round-robin", // or "random"
});
const result = await reader.scrape({
urls: ["https://example.com", "https://example.org"],
formats: ["markdown"],
batchConcurrency: 2,
});
await reader.close();
With Browser Pool Configuration
import { ReaderClient } from "@vakra-dev/reader";
const reader = new ReaderClient({
browserPool: {
size: 5, // 5 browser instances
retireAfterPages: 50, // Recycle after 50 pages
retireAfterMinutes: 15, // Recycle after 15 minutes
},
verbose: true,
});
const result = await reader.scrape({
urls: manyUrls,
batchConcurrency: 5,
});
await reader.close();
CLI Reference
Daemon Mode
For multiple requests, start a daemon to keep browser pool warm:
# Start daemon with browser pool
npx reader start --pool-size 5
# All subsequent commands auto-connect to daemon
npx reader scrape https://example.com
npx reader crawl https://example.com -d 2
# Check daemon status
npx reader status
# Stop daemon
npx reader stop
# Force standalone mode (bypass daemon)
npx reader scrape https://example.com --standalone
reader scrape <urls...>
Scrape one or more URLs.
# Scrape a single URL
npx reader scrape https://example.com
# Scrape with multiple formats
npx reader scrape https://example.com -f markdown,html
# Scrape multiple URLs concurrently
npx reader scrape https://example.com https://example.org -c 2
# Save to file
npx reader scrape https://example.com -o output.md
| Option | Type | Default | Description |
| ------------------------ | ------ | ------------ | ------------------------------------------------------- |
| -f, --format <formats> | string | "markdown" | Output formats (comma-separated: markdown,html) |
| -o, --output <file> | string | stdout | Output file path |
| -c, --concurrency <n> | number | 1 | Parallel requests |
| -t, --timeout <ms> | number | 30000 | Request timeout in milliseconds |
| --batch-timeout <ms> | number | 300000 | Total timeout for entire batch operation |
| --proxy <url> | string | - | Proxy URL (e.g., http://user:pass@host:port) |
| --user-agent <string> | string | - | Custom user agent string |
| --show-chrome | flag | - | Show browser window for debugging |
| --no-main-content | flag | - | Disable main content extraction (include full page) |
| --include-tags <sel> | string | - | CSS selectors for elements to include (comma-separated) |
| --exclude-tags <sel> | string | - | CSS selectors for elements to exclude (comma-separated) |
| -v, --verbose | flag | - | Enable verbose logging |
reader crawl <url>
Crawl a website to discover pages.
# Crawl with default settings
npx reader crawl https://example.com
# Crawl deeper with more pages
npx reader crawl https://example.com -d 3 -m 50
# Crawl and scrape content
npx reader crawl https://example.com -d 2 --scrape
# Filter URLs with patterns
npx reader crawl https://example.com --include "blog/*" --exclude "admin/*"
| Option | Type | Default | Description |
| ------------------------ | ------ | ------------ | ----------------------------------------------- |
| -d, --depth <n> | number | 1 | Maximum crawl depth |
| -m, --max-pages <n> | number | 20 | Maximum pages to discover |
| -s, --scrape | flag | - | Also scrape content of discovered pages |
| -f, --format <formats> | string | "markdown" | Output formats when scraping (comma-separated) |
| -o, --output <file> | string | stdout | Output file path
Related Skills
Writing Hookify Rules
80.7kThis skill should be used when the user asks to "create a hookify rule", "write a hook rule", "configure hookify", "add a hookify rule", or needs guidance on hookify rule syntax and patterns.
docs-writer
98.5k`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie
model-usage
327.7kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
cursorrules
A collection of .cursorrules
