SkillAgentSearch skills...

Docrawl

Docs‑focused crawler that converts documentation sites to clean Markdown.

Install / Use

/learn @neur0map/Docrawl
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<div align="center">

docrawl

@@@@@@@ @@@@@@ @@@@@@@ @@@@@@@ @@@@@@ @@@ @@@ @@@ @@@
@@@@@@@@ @@@@@@@@ @@@@@@@@ @@@@@@@@ @@@@@@@@ @@@ @@@ @@@ @@@
@@! @@@ @@! @@@ !@@ @@! @@@ @@! @@@ @@! @@! @@! @@!
!@! @!@ !@! @!@ !@! !@! @!@ !@! @!@ !@! !@! !@! !@!
@!@ !@! @!@ !@! !@! @!@!!@! @!@!@!@! @!! !!@ @!@ @!!
!@! !!! !@! !!! !!! !!@!@! !!!@!!!! !@! !!! !@! !!!
!!: !!! !!: !!! :!! !!: :!! !!: !!! !!: !!: !!: !!:
:!: !:! :!: !:! :!: :!: !:! :!: !:! :!: :!: :!: :!:
:::: :: ::::: :: ::: ::: :: ::: :: ::: :::: :: ::: :: ::::
:: : : : : : :: :: : : : : : : : :: : : : : :: : :
#jhhjhj

A documentation-focused web crawler that converts sites to clean Markdown while preserving structure and staying polite.

Demo VideoCrates.ioGitHub

Crates.io Documentation License: MIT

</div>

Installation

# Install from crates.io
cargo install docrawl

# Or build from source
git clone https://github.com/neur0map/docrawl
cd docrawl
cargo build --release

Quick Start

docrawl "https://docs.rust-lang.org"          # crawl with default depth
docrawl "https://docs.python.org" --all       # full site crawl
docrawl "https://react.dev" --depth 2         # shallow crawl
docrawl "https://nextjs.org/docs" --fast      # quick scan without assets
docrawl "https://example.com/docs" --silence  # suppress progress/status output
docrawl --update                              # update to latest version

Key Features

  • Documentation-optimized extraction - Built-in selectors for Docusaurus, MkDocs, Sphinx, Next.js docs
  • Clean Markdown output - Preserves code blocks, tables, and formatting with YAML frontmatter metadata
  • Path-mirroring structure - Maintains original URL hierarchy as folders with index.md files
  • Polite crawling - Respects robots.txt, rate limits, and sitemap hints
  • Security-first - Sanitizes content, detects prompt injections, quarantines suspicious pages
  • Self-updating - Built-in update mechanism via docrawl --update

Why docrawl?

Unlike general-purpose crawlers, docrawl is purpose-built for documentation:

| Tool | Purpose | Output | Documentation Support | |------|---------|--------|----------------------| | wget/curl | File downloading | Raw HTML | No extraction | | httrack | Website mirroring | Full HTML site | No Markdown conversion | | scrapy | Web scraping framework | Custom formats | Requires coding | | docrawl | Documentation crawler | Clean Markdown | Auto-detects docs frameworks |

docrawl combines crawling, extraction, and conversion in a single tool optimized for technical documentation.

Library Usage

Add to your Cargo.toml:

[dependencies]
docrawl = "0.1"
tokio = { version = "1", features = ["full"] }

Minimal example:

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let cfg = docrawl::CrawlConfig {
        base_url: url::Url::parse("https://example.com/docs")?,
        output_dir: std::path::PathBuf::from("./out"),
        max_depth: Some(3),
        ..Default::default()
    };
    let stats = docrawl::crawl(cfg).await?;
    println!("Crawled {} pages", stats.pages);
    Ok(())
}

CLI Options

| Option | Description | Default | |--------|-------------|---------| | --depth <n> | Maximum crawl depth | 10 | | --all | Crawl entire site | - | | --output <dir> | Output directory | Current dir | | --rate <n> | Requests per second | 10 | | --concurrency <n> | Parallel workers | 16 | | --selector <css> | Custom content selector | Auto-detect | | --fast | Quick mode (no assets, rate=50, concurrency=32) | - | | --resume | Continue previous crawl | - | | --silence | Suppress built-in progress/status output | - | | --update | Update to latest version from crates.io | - |

Configuration

Place an optional docrawl.config.json in the output directory (-o) or the current working directory. The output directory is checked first. CLI arguments always take precedence over config file values.

{
  "selectors": [".content", "article"],
  "exclude_patterns": ["\\.pdf$", "/api/"],
  "max_pages": 1000,
  "host_only": true
}

Output Structure

output/
└── example.com/
    ├── index.md
    ├── guide/
    │   └── index.md
    ├── assets/
    │   └── images/
    └── manifest.json

Each Markdown file includes frontmatter:

---
title: Page Title
source_url: https://example.com/page
fetched_at: 2025-01-18T12:00:00Z
---

Performance

docrawl is optimized for speed and efficiency:

  • Fast HTML to Markdown conversion using fast_html2md
  • Concurrent processing with configurable worker pools
  • Intelligent rate limiting to respect server resources
  • Persistent caching to avoid duplicate work
  • Memory-efficient streaming for large sites

Security

docrawl includes built-in security features:

  • Content sanitization removes potentially harmful HTML
  • Prompt injection detection identifies and quarantines suspicious content
  • URL validation prevents malicious redirects
  • File system sandboxing restricts output to specified directories
  • Rate limiting prevents overwhelming target servers

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

License

MIT

View on GitHub
GitHub Stars42
CategoryDevelopment
Updated6h ago
Forks2

Languages

Rust

Security Score

90/100

Audited on Apr 2, 2026

No findings