Markdocify
๐ค Transform any documentation site into clean, LLM-ready markdown
Install / Use
/learn @vladkampov/MarkdocifyREADME
๐ markdocify
Comprehensively scrape documentation sites into beautiful, LLM-ready Markdown
markdocify is a powerful CLI tool that comprehensively scrapes documentation websites and converts them into well-formatted, single Markdown files. Perfect for creating LLM training data, offline documentation, or comprehensive knowledge bases.
โจ Features
- ๐ฏ Comprehensive Coverage: Scrapes deep hierarchical documentation (8 levels by default)
- ๐ง Intelligent Content Detection: Auto-detects documentation patterns across popular frameworks
- ๐ซ Smart Filtering: Automatically excludes navigation, ads, and non-documentation content
- โก High Performance: Concurrent scraping with configurable workers and delays
- ๐ Progress Reporting: Real-time progress updates for long scrapes
- ๐ง Zero Configuration: Works out-of-the-box for most documentation sites
- ๐จ Clean Output: Generates well-formatted Markdown with table of contents
- ๐ก๏ธ Respectful Scraping: Built-in rate limiting and robots.txt compliance
๐ Quick Start
Installation
๐บ Homebrew (macOS/Linux) - Recommended
# Add our tap and install
brew tap vladkampov/tap
brew install markdocify
# Or install directly
brew install vladkampov/tap/markdocify
โฌ๏ธ Direct Download
# Download latest release for your platform
curl -L https://github.com/vladkampov/markdocify/releases/latest/download/markdocify-linux-amd64 -o markdocify
chmod +x markdocify
# Or for macOS
curl -L https://github.com/vladkampov/markdocify/releases/latest/download/markdocify-darwin-amd64 -o markdocify
chmod +x markdocify
๐ณ Docker
# Run directly with Docker
docker run --rm -v $(pwd):/workspace ghcr.io/vladkampov/markdocify:latest https://example.com/docs
# Or use as base image
FROM ghcr.io/vladkampov/markdocify:latest
๐ง Build from Source
git clone https://github.com/vladkampov/markdocify.git
cd markdocify
make build
Go Install
go install github.com/vladkampov/markdocify/cmd/markdocify@latest
Basic Usage
# Comprehensive scrape (recommended) - captures full documentation
markdocify https://vercel.com/docs
# Quick scrape - lighter, faster
markdocify https://docs.example.com -d 3
# Custom output file
markdocify https://react.dev/docs -o react-complete-docs.md
# Adjust performance settings
markdocify https://site.com/docs -d 5 --concurrency 4
๐ก Use Cases
๐ LLM Training Data
Create comprehensive, clean Markdown datasets from documentation sites:
markdocify https://nextjs.org/docs -o nextjs-training-data.md
markdocify https://docs.python.org -o python-docs.md
markdocify https://kubernetes.io/docs -o k8s-complete.md
๐ Offline Documentation
Generate complete offline documentation archives:
markdocify https://docs.aws.amazon.com/ec2 -o aws-ec2-offline.md
markdocify https://tailwindcss.com/docs -o tailwind-offline.md
๐ Knowledge Bases
Create searchable, comprehensive knowledge bases:
markdocify https://docs.github.com -o github-docs-complete.md
markdocify https://api.stripe.com/docs -o stripe-api-complete.md
๐ฏ Supported Sites
markdocify works great with most documentation sites, including:
- Frameworks: React, Vue, Angular, Next.js, Nuxt, SvelteKit, Astro
- Platforms: Vercel, Netlify, AWS, Google Cloud, Azure
- Languages: Python, Go, Rust, JavaScript, TypeScript docs
- Tools: Docker, Kubernetes, Terraform, GitHub, GitLab
- Databases: PostgreSQL, MongoDB, Redis documentation
- And many more!
โ๏ธ Configuration
Command Line Options
markdocify [URL] [flags]
Flags:
-c, --config string Configuration file path
-o, --output string Output file path
-d, --depth int Maximum crawl depth (default 8)
--concurrency int Number of concurrent workers (default 3)
-h, --help Help for markdocify
-v, --version Version information
Advanced Configuration
For complex sites, use YAML configuration files:
# custom-config.yml
name: "Custom Documentation"
base_url: "https://example.com"
output_file: "custom-docs.md"
start_urls:
- "https://example.com/docs"
- "https://example.com/api"
follow_patterns:
- "^https://example\\.com/docs/.*"
- "^https://example\\.com/api/.*"
processing:
max_depth: 10
concurrency: 5
delay: 0.5
preserve_code_blocks: true
generate_toc: true
selectors:
title: "h1, .page-title"
content: "main, .documentation"
exclude:
- "nav"
- ".sidebar"
- "footer"
Use with: markdocify -c custom-config.yml
๐ Performance & Output
Typical Results
| Site | Pages Scraped | Output Size | Time | |------|---------------|-------------|------| | Vercel Docs | 100+ pages | 2-5MB | 3-5 min | | Next.js Docs | 80+ pages | 1-3MB | 2-4 min | | React Docs | 50+ pages | 800KB-2MB | 1-3 min |
Output Quality
markdocify generates:
- ๐ Table of Contents with deep linking
- ๐ท๏ธ Metadata including source URLs and timestamps
- ๐จ Clean formatting with preserved code blocks
- ๐ Resolved links and proper heading hierarchy
- ๐งน Filtered content with navigation/ads removed
๐ ๏ธ Development
Prerequisites
- Go 1.21+
- Make
Building
# Clone repository
git clone https://github.com/vladkampov/markdocify.git
cd markdocify
# Download dependencies
go mod tidy
# Build
make build
# Run tests
make test
# Cross-platform build
make build-all
Project Structure
markdocify/
โโโ cmd/markdocify/ # CLI application
โโโ internal/
โ โโโ config/ # Configuration handling
โ โโโ scraper/ # Web scraping engine
โ โโโ converter/ # HTML to Markdown conversion
โ โโโ aggregator/ # Document aggregation & TOC
โ โโโ types/ # Shared types
โโโ configs/examples/ # Example configurations
โโโ README.md
๐ค Contributing
We welcome contributions! Please see CONTRIBUTING.md for details.
Quick Contribution Guide
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature - Make your changes with tests
- Test thoroughly:
make test && make lint - Commit with clear messages
- Submit a pull request
Areas We Need Help
- ๐ JavaScript rendering support (ChromeDP integration)
- ๐ More content selectors for different documentation frameworks
- ๐จ Output formats (JSON, HTML, etc.)
- ๐ Performance optimizations
- ๐ Documentation improvements
- ๐งช Test coverage expansion
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ Acknowledgments
- Built with Colly for web scraping
- Powered by html-to-markdown for conversion
- CLI built with Cobra
- Inspired by the need for high-quality LLM training data
๐ Support
- ๐ Bug Reports: GitHub Issues
- ๐ก Feature Requests: GitHub Discussions
- ๐ Documentation: Project Wiki
<p align="center"> <strong>Made with โค๏ธ for the developer community</strong><br> Star โญ this repo if you find it useful! </p>
Related Skills
node-connect
344.4kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
99.2kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
344.4kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
344.4kQQBot ๅฏๅชไฝๆถๅ่ฝๅใไฝฟ็จ <qqmedia> ๆ ็ญพ๏ผ็ณป็ปๆ นๆฎๆไปถๆฉๅฑๅ่ชๅจ่ฏๅซ็ฑปๅ๏ผๅพ็/่ฏญ้ณ/่ง้ข/ๆไปถ๏ผใ
