Markdocify

🤖 Transform any documentation site into clean, LLM-ready markdown

Generate Convert Improve

Install / Use

/learn @vladkampov/Markdocify

About this skill

Quality Score

0/100

README

📚 markdocify

Comprehensively scrape documentation sites into beautiful, LLM-ready Markdown

markdocify is a powerful CLI tool that comprehensively scrapes documentation websites and converts them into well-formatted, single Markdown files. Perfect for creating LLM training data, offline documentation, or comprehensive knowledge bases.

✨ Features

🎯 Comprehensive Coverage: Scrapes deep hierarchical documentation (8 levels by default)
🧠 Intelligent Content Detection: Auto-detects documentation patterns across popular frameworks
🚫 Smart Filtering: Automatically excludes navigation, ads, and non-documentation content
⚡ High Performance: Concurrent scraping with configurable workers and delays
📊 Progress Reporting: Real-time progress updates for long scrapes
🔧 Zero Configuration: Works out-of-the-box for most documentation sites
🎨 Clean Output: Generates well-formatted Markdown with table of contents
🛡️ Respectful Scraping: Built-in rate limiting and robots.txt compliance

🚀 Quick Start

Installation

🍺 Homebrew (macOS/Linux) - Recommended

# Add our tap and install
brew tap vladkampov/tap
brew install markdocify

# Or install directly
brew install vladkampov/tap/markdocify

⬇️ Direct Download

# Download latest release for your platform
curl -L https://github.com/vladkampov/markdocify/releases/latest/download/markdocify-linux-amd64 -o markdocify
chmod +x markdocify

# Or for macOS
curl -L https://github.com/vladkampov/markdocify/releases/latest/download/markdocify-darwin-amd64 -o markdocify
chmod +x markdocify

🐳 Docker

# Run directly with Docker
docker run --rm -v $(pwd):/workspace ghcr.io/vladkampov/markdocify:latest https://example.com/docs

# Or use as base image
FROM ghcr.io/vladkampov/markdocify:latest

🔧 Build from Source

git clone https://github.com/vladkampov/markdocify.git
cd markdocify
make build

Go Install

go install github.com/vladkampov/markdocify/cmd/markdocify@latest

Basic Usage

# Comprehensive scrape (recommended) - captures full documentation
markdocify https://vercel.com/docs

# Quick scrape - lighter, faster
markdocify https://docs.example.com -d 3

# Custom output file
markdocify https://react.dev/docs -o react-complete-docs.md

# Adjust performance settings
markdocify https://site.com/docs -d 5 --concurrency 4

💡 Use Cases

📖 LLM Training Data

Create comprehensive, clean Markdown datasets from documentation sites:

markdocify https://nextjs.org/docs -o nextjs-training-data.md
markdocify https://docs.python.org -o python-docs.md  
markdocify https://kubernetes.io/docs -o k8s-complete.md

📚 Offline Documentation

Generate complete offline documentation archives:

markdocify https://docs.aws.amazon.com/ec2 -o aws-ec2-offline.md
markdocify https://tailwindcss.com/docs -o tailwind-offline.md

🔍 Knowledge Bases

Create searchable, comprehensive knowledge bases:

markdocify https://docs.github.com -o github-docs-complete.md
markdocify https://api.stripe.com/docs -o stripe-api-complete.md

🎯 Supported Sites

markdocify works great with most documentation sites, including:

Frameworks: React, Vue, Angular, Next.js, Nuxt, SvelteKit, Astro
Platforms: Vercel, Netlify, AWS, Google Cloud, Azure
Languages: Python, Go, Rust, JavaScript, TypeScript docs
Tools: Docker, Kubernetes, Terraform, GitHub, GitLab
Databases: PostgreSQL, MongoDB, Redis documentation
And many more!

⚙️ Configuration

Command Line Options

markdocify [URL] [flags]

Flags:
  -c, --config string      Configuration file path
  -o, --output string      Output file path  
  -d, --depth int          Maximum crawl depth (default 8)
      --concurrency int    Number of concurrent workers (default 3)
  -h, --help              Help for markdocify
  -v, --version           Version information

Advanced Configuration

For complex sites, use YAML configuration files:

# custom-config.yml
name: "Custom Documentation"
base_url: "https://example.com"
output_file: "custom-docs.md"

start_urls:
  - "https://example.com/docs"
  - "https://example.com/api"

follow_patterns:
  - "^https://example\\.com/docs/.*"
  - "^https://example\\.com/api/.*"

processing:
  max_depth: 10
  concurrency: 5
  delay: 0.5
  preserve_code_blocks: true
  generate_toc: true

selectors:
  title: "h1, .page-title"
  content: "main, .documentation"
  exclude:
    - "nav"
    - ".sidebar"
    - "footer"

Use with: markdocify -c custom-config.yml

📊 Performance & Output

Typical Results

| Site | Pages Scraped | Output Size | Time | |------|---------------|-------------|------| | Vercel Docs | 100+ pages | 2-5MB | 3-5 min | | Next.js Docs | 80+ pages | 1-3MB | 2-4 min | | React Docs | 50+ pages | 800KB-2MB | 1-3 min |

Output Quality

markdocify generates:

📑 Table of Contents with deep linking
🏷️ Metadata including source URLs and timestamps
🎨 Clean formatting with preserved code blocks
🔗 Resolved links and proper heading hierarchy
🧹 Filtered content with navigation/ads removed

🛠️ Development

Prerequisites

Go 1.21+
Make

Building

# Clone repository
git clone https://github.com/vladkampov/markdocify.git
cd markdocify

# Download dependencies
go mod tidy

# Build
make build

# Run tests
make test

# Cross-platform build
make build-all

Project Structure

markdocify/
├── cmd/markdocify/          # CLI application
├── internal/
│   ├── config/             # Configuration handling
│   ├── scraper/            # Web scraping engine
│   ├── converter/          # HTML to Markdown conversion
│   ├── aggregator/         # Document aggregation & TOC
│   └── types/              # Shared types
├── configs/examples/       # Example configurations
└── README.md

🤝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for details.

Quick Contribution Guide

Fork the repository
Create a feature branch: git checkout -b feature/amazing-feature
Make your changes with tests
Test thoroughly: make test && make lint
Commit with clear messages
Submit a pull request

Areas We Need Help

🌐 JavaScript rendering support (ChromeDP integration)
🔍 More content selectors for different documentation frameworks
🎨 Output formats (JSON, HTML, etc.)
🚀 Performance optimizations
📚 Documentation improvements
🧪 Test coverage expansion

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Built with Colly for web scraping
Powered by html-to-markdown for conversion
CLI built with Cobra
Inspired by the need for high-quality LLM training data

📞 Support

🐛 Bug Reports: GitHub Issues
💡 Feature Requests: GitHub Discussions
📖 Documentation: Project Wiki

Made with ❤️ for the developer community Star ⭐ this repo if you find it useful!

Related Skills

node-connect

344.4k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

99.2k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

344.4k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

344.4k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。