Reader

Open-source, production-grade web scraping engine built for LLMs. Scrape and crawl the entire web, clean markdown, ready for your agents.

Generate Convert Improve

Install / Use

/learn @vakra-dev/Reader

About this skill

Quality Score

0/100

README

<img src="docs/assets/logo.png" alt="Reader Logo" width="200" /> <h1 align="center">Reader</h1> Open-source, production-grade web scraping engine built for LLMs. Scrape and crawl the entire web, clean markdown, ready for your agents. <a href="https://opensource.org/licenses/Apache-2.0"><img src="https://img.shields.io/badge/License-Apache_2.0-blue.svg" alt="License: Apache 2.0"></a> <a href="https://www.npmjs.com/package/@vakra-dev/reader"><img src="https://img.shields.io/npm/v/@vakra-dev/reader.svg" alt="npm version"></a> <a href="https://github.com/vakra-dev/reader/stargazers"><img src="https://img.shields.io/github/stars/vakra-dev/reader.svg?style=social" alt="GitHub stars"></a> <a href="https://docs.reader.dev">Docs</a> · <a href="https://docs.reader.dev/home/examples">Examples</a> · <a href="https://discord.gg/6tjkq7J5WV">Discord</a> <img src="./docs/assets/demo.gif" alt="Reader demo — scrape any URL to clean markdown" width="700" />

The Problem

Building agents that need web access is frustrating. You piece together Puppeteer, add stealth plugins, fight Cloudflare, manage proxies and it still breaks in production.

Because production grade web scraping isn't about rendering a page and converting HTML to markdown. It's about everything underneath:

| Layer | What it actually takes | | ------------------------ | ------------------------------------------------------------------- | | Browser architecture | Managing browser instances at scale, not one-off scripts | | Anti-bot bypass | Cloudflare, Turnstile, JS challenges, they all block naive scrapers | | TLS fingerprinting | Real browsers have fingerprints. Puppeteer doesn't. Sites know. | | Proxy infrastructure | Datacenter vs residential, rotation strategies, sticky sessions | | Resource management | Browser pooling, memory limits, graceful recycling | | Reliability | Rate limiting, retries, timeouts, caching, graceful degradation |

I built Reader, a production-grade web scraping engine on top of Ulixee Hero, a headless browser designed for exactly this.

The Solution

Two primitives. That's it.

import { ReaderClient } from "@vakra-dev/reader";

const reader = new ReaderClient();

// Scrape URLs → clean markdown
const result = await reader.scrape({ urls: ["https://example.com"] });
console.log(result.data[0].markdown);

// Crawl a site → discover + scrape pages
const pages = await reader.crawl({
  url: "https://example.com",
  depth: 2,
  scrape: true,
});
console.log(`Found ${pages.urls.length} pages`);

All the hard stuff, browser pooling, challenge detection, proxy rotation, retries, happens under the hood. You get clean markdown. Your agents get the web.

[!TIP] If Reader is useful to you, a star on GitHub helps others discover the project.

Features

Cloudflare Bypass - TLS fingerprinting, DNS over TLS, WebRTC masking
Clean Output - Markdown and HTML with automatic main content extraction
Smart Content Cleaning - Removes nav, headers, footers, popups, cookie banners
CLI & API - Use from command line or programmatically
Browser Pool - Auto-recycling, health monitoring, queue management
Concurrent Scraping - Parallel URL processing with progress tracking
Website Crawling - BFS link discovery with depth/page limits
Proxy Support - Datacenter and residential with sticky sessions

Installation

npm install @vakra-dev/reader

Requirements: Node.js >= 18

Quick Start

Basic Scrape

import { ReaderClient } from "@vakra-dev/reader";

const reader = new ReaderClient();

const result = await reader.scrape({
  urls: ["https://example.com"],
  formats: ["markdown", "html"],
});

console.log(result.data[0].markdown);
console.log(result.data[0].html);

await reader.close();

Batch Scraping with Concurrency

import { ReaderClient } from "@vakra-dev/reader";

const reader = new ReaderClient();

const result = await reader.scrape({
  urls: ["https://example.com", "https://example.org", "https://example.net"],
  formats: ["markdown"],
  batchConcurrency: 3,
  onProgress: (progress) => {
    console.log(`${progress.completed}/${progress.total}: ${progress.currentUrl}`);
  },
});

console.log(`Scraped ${result.batchMetadata.successfulUrls} URLs`);

await reader.close();

Crawling

import { ReaderClient } from "@vakra-dev/reader";

const reader = new ReaderClient();

const result = await reader.crawl({
  url: "https://example.com",
  depth: 2,
  maxPages: 20,
  scrape: true,
});

console.log(`Discovered ${result.urls.length} URLs`);
console.log(`Scraped ${result.scraped?.batchMetadata.successfulUrls} pages`);

await reader.close();

With Proxy

import { ReaderClient } from "@vakra-dev/reader";

const reader = new ReaderClient();

const result = await reader.scrape({
  urls: ["https://example.com"],
  formats: ["markdown"],
  proxy: {
    type: "residential",
    host: "proxy.example.com",
    port: 8080,
    username: "username",
    password: "password",
    country: "us",
  },
});

await reader.close();

With Proxy Rotation

import { ReaderClient } from "@vakra-dev/reader";

const reader = new ReaderClient({
  proxies: [
    { host: "proxy1.example.com", port: 8080, username: "user", password: "pass" },
    { host: "proxy2.example.com", port: 8080, username: "user", password: "pass" },
  ],
  proxyRotation: "round-robin", // or "random"
});

const result = await reader.scrape({
  urls: ["https://example.com", "https://example.org"],
  formats: ["markdown"],
  batchConcurrency: 2,
});

await reader.close();

With Browser Pool Configuration

import { ReaderClient } from "@vakra-dev/reader";

const reader = new ReaderClient({
  browserPool: {
    size: 5, // 5 browser instances
    retireAfterPages: 50, // Recycle after 50 pages
    retireAfterMinutes: 15, // Recycle after 15 minutes
  },
  verbose: true,
});

const result = await reader.scrape({
  urls: manyUrls,
  batchConcurrency: 5,
});

await reader.close();

CLI Reference

Daemon Mode

For multiple requests, start a daemon to keep browser pool warm:

# Start daemon with browser pool
npx reader start --pool-size 5

# All subsequent commands auto-connect to daemon
npx reader scrape https://example.com
npx reader crawl https://example.com -d 2

# Check daemon status
npx reader status

# Stop daemon
npx reader stop

# Force standalone mode (bypass daemon)
npx reader scrape https://example.com --standalone

`reader scrape <urls...>`

Scrape one or more URLs.

# Scrape a single URL
npx reader scrape https://example.com

# Scrape with multiple formats
npx reader scrape https://example.com -f markdown,html

# Scrape multiple URLs concurrently
npx reader scrape https://example.com https://example.org -c 2

# Save to file
npx reader scrape https://example.com -o output.md

| Option | Type | Default | Description | | ------------------------ | ------ | ------------ | ------------------------------------------------------- | | -f, --format <formats> | string | "markdown" | Output formats (comma-separated: markdown,html) | | -o, --output <file> | string | stdout | Output file path | | -c, --concurrency <n> | number | 1 | Parallel requests | | -t, --timeout <ms> | number | 30000 | Request timeout in milliseconds | | --batch-timeout <ms> | number | 300000 | Total timeout for entire batch operation | | --proxy <url> | string | - | Proxy URL (e.g., http://user:pass@host:port) | | --user-agent <string> | string | - | Custom user agent string | | --show-chrome | flag | - | Show browser window for debugging | | --no-main-content | flag | - | Disable main content extraction (include full page) | | --include-tags <sel> | string | - | CSS selectors for elements to include (comma-separated) | | --exclude-tags <sel> | string | - | CSS selectors for elements to exclude (comma-separated) | | -v, --verbose | flag | - | Enable verbose logging |

`reader crawl <url>`

Crawl a website to discover pages.

# Crawl with default settings
npx reader crawl https://example.com

# Crawl deeper with more pages
npx reader crawl https://example.com -d 3 -m 50

# Crawl and scrape content
npx reader crawl https://example.com -d 2 --scrape

# Filter URLs with patterns
npx reader crawl https://example.com --include "blog/*" --exclude "admin/*"

| Option | Type | Default | Description | | ------------------------ | ------ | ------------ | ----------------------------------------------- | | -d, --depth <n> | number | 1 | Maximum crawl depth | | -m, --max-pages <n> | number | 20 | Maximum pages to discover | | -s, --scrape | flag | - | Also scrape content of discovered pages | | -f, --format <formats> | string | "markdown" | Output formats when scraping (comma-separated) | | -o, --output <file> | string | stdout | Output file path

Related Skills

Writing Hookify Rules

80.7k

This skill should be used when the user asks to "create a hookify rule", "write a hook rule", "configure hookify", "add a hookify rule", or needs guidance on hookify rule syntax and patterns.

docs-writer

98.5k

`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie

model-usage

327.7k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

cursorrules

A collection of .cursorrules