Docpull
Crawl any website and convert it to clean, AI-ready Markdown — async Python CLI with MCP support, crawl profiles, caching, and RAG-optimized output
Install / Use
/learn @raintree-technology/DocpullREADME
docpull
Pull documentation from any website and convert it to clean, AI-ready Markdown.
<p align="center"> <a href="https://docpull.raintree.technology"> <img src="https://pub-e85a1abca36f4fd8b4300a6ec2d6f45f.r2.dev/marketing/docpull/1768954147343-iaiziy-docpull-terminal-hero.gif" alt="docpull demo" width="600"> </a> </p>Install
pip install docpull
Usage
# Basic fetch
docpull https://docs.example.com
# With options
docpull https://aptos.dev --max-pages 100 --output-dir ./docs
# Filter paths
docpull https://docs.example.com --include-paths "/api/*" --exclude-paths "/changelog/*"
# Enable caching for incremental updates
docpull https://docs.example.com --cache
# JavaScript-heavy sites
pip install docpull[js]
docpull https://spa-site.com --js
Profiles
docpull https://site.com --profile rag # Optimized for RAG/LLM (default)
docpull https://site.com --profile mirror # Full site archive with caching
docpull https://site.com --profile quick # Fast sampling (50 pages, depth 2)
Options
Crawl:
--max-pages N Maximum pages to fetch
--max-depth N Maximum crawl depth
--include-paths P Only crawl matching URL patterns
--exclude-paths P Skip matching URL patterns
--js Enable JavaScript rendering
Cache:
--cache Enable caching for incremental updates
--cache-dir DIR Cache directory (default: .docpull-cache)
--cache-ttl DAYS Days before cache expires (default: 30)
Content:
--streaming-dedup Real-time duplicate detection
--language CODE Filter by language (e.g., en)
Output:
--output-dir, -o DIR Output directory (default: ./docs)
--dry-run Show what would be fetched
--verbose, -v Verbose output
See docpull --help for all options.
Python API
import asyncio
from docpull import Fetcher, DocpullConfig, ProfileName, EventType
async def main():
config = DocpullConfig(
url="https://docs.example.com",
profile=ProfileName.RAG,
crawl={"max_pages": 100},
cache={"enabled": True},
)
async with Fetcher(config) as fetcher:
async for event in fetcher.run():
if event.type == EventType.FETCH_PROGRESS:
print(f"{event.current}/{event.total}: {event.url}")
print(f"Done: {fetcher.stats.pages_fetched} pages")
asyncio.run(main())
Output
Each page becomes a Markdown file with YAML frontmatter:
---
title: "Getting Started"
source: https://docs.example.com/guide
---
# Getting Started
...
Security
- HTTPS-only, mandatory robots.txt compliance
- Blocks private/internal network IPs
- Path traversal and XXE protection
Troubleshooting
docpull --doctor # Check installation
docpull URL --verbose # Verbose output
docpull URL --dry-run # Test without downloading
Links
License
MIT
Related Skills
prose
344.1kOpenProse VM skill pack. Activate on any `prose` command, .prose files, or OpenProse mentions; orchestrates multi-agent workflows.
claude-opus-4-5-migration
96.8kMigrate prompts and code from Claude Sonnet 4.0, Sonnet 4.5, or Opus 4.1 to Opus 4.5
Writing Hookify Rules
96.8kThis skill should be used when the user asks to "create a hookify rule", "write a hook rule", "configure hookify", "add a hookify rule", or needs guidance on hookify rule syntax and patterns.
Command Development
96.8kThis skill should be used when the user asks to "create a slash command", "add a command", "write a custom command", "define command arguments", "use command frontmatter", "organize commands", "create command with file references", "interactive command", "use AskUserQuestion in command", or needs guidance on slash command structure, YAML frontmatter fields, dynamic arguments, bash execution in commands, user interaction patterns, or command development best practices for Claude Code.
