DeepHarvest
DeepHarvest is a Python web crawler with JavaScript rendering, distributed crawling, ML-based trap detection, and multilingual support. Extracts content from HTML, PDFs, Office docs, images, and media.
Install / Use
/learn @Anajrajeev/DeepHarvestREADME
DeepHarvest
The World's Most Complete, Resilient, Multilingual Web Crawler
Features
Core Capabilities
- Complete Coverage: Crawls entire websites including all subpages
- All Content Types: HTML, PDF, DOCX, PPTX, XLSX, images, audio, video
- JavaScript Support: Full SPA support with Playwright
- Multilingual: Handles all languages, encodings, and scripts
- Distributed: Redis-based distributed crawling with multiple workers
- Resumable: Full checkpoint and resume support for interrupted crawls (local mode)
- Intelligent: ML-based trap detection, content extraction, deduplication
Advanced Features
- Smart Trap Detection: Calendar, pagination, session ID, faceted navigation
- ML Content Extraction: Page classification, soft-404 detection, quality scoring
- Advanced URL Management: SimHash, MinHash, LSH deduplication
- Site Graph Analysis: PageRank, clustering, GraphML export
- Observability: Prometheus metrics, Grafana dashboards
- Extensible: Plugin system for custom extractors
- OSINT Mode: Entity extraction, technology detection, link graph analysis
- Browser Automation: High-level Playwright integration with screenshot capture
- Pipeline Execution: YAML-based pipeline runner for complex workflows
- API Server: REST API for programmatic access
- Multiple Exporters: JSONL, Parquet, SQLite, VectorDB (FAISS/Chroma) support
Quick Start
Installation
pip install deepharvest
Basic Usage
Simple Crawls
# Basic crawl with depth limit
deepharvest crawl https://example.com --depth 5 --output ./output
# Crawl without JavaScript rendering (faster)
deepharvest crawl https://example.com --no-js --depth 3
# Crawl with JavaScript rendering (for SPAs)
deepharvest crawl https://example.com --js --depth 3
Limiting Crawl Scope
# Limit total number of URLs crawled
deepharvest crawl https://example.com --max-urls 1000 --depth 5
# Limit response size (skip large files)
deepharvest crawl https://example.com --max-size 10 --depth 3
# Limit pages per domain (useful for multi-domain crawls)
deepharvest crawl https://example.com --max-pages-per-domain 50 --depth 5
# Set time limit (stop after specified seconds)
deepharvest crawl https://example.com --time-limit 3600 --depth 5
# Combine multiple limits
deepharvest crawl https://example.com \
--depth 5 \
--max-urls 500 \
--max-pages-per-domain 100 \
--max-size 5 \
--time-limit 1800 \
--output ./output
Distributed Crawling
# Run in distributed mode with Redis
deepharvest crawl https://example.com \
--distributed \
--redis-url redis://localhost:6379 \
--workers 5 \
--depth 10
Using Configuration Files
# Use a YAML config file
deepharvest crawl --config config.yaml
Resuming Interrupted Crawls
# Resume from a checkpoint file
deepharvest resume --state-file crawl_state.json
# Resume with custom config
deepharvest resume --state-file crawl_state.json --config config.yaml
# Resume with different output directory
deepharvest resume --state-file crawl_state.json --output ./new_output
Note: Resume functionality works in local mode only. In distributed mode, Redis persistence handles state management.
OSINT Mode
# Basic OSINT collection
deepharvest osint https://example.com
# With JSON output and link graph
deepharvest osint https://example.com --json --graph
# With screenshots
deepharvest osint https://example.com --screenshot
API Server
# Start API server
deepharvest serve --host 0.0.0.0 --port 8000
Pipeline Execution
# Run a pipeline from YAML file
deepharvest run pipeline.yaml
Python API
import asyncio
from deepharvest import DeepHarvest, CrawlConfig
async def main():
config = CrawlConfig(
seed_urls=["https://example.com"],
max_depth=5,
enable_js=True
)
crawler = DeepHarvest(config)
await crawler.initialize()
await crawler.crawl()
await crawler.shutdown()
asyncio.run(main())
Installation
From PyPI
pip install deepharvest
From Source
git clone https://github.com/deepharvest/deepharvest
cd deepharvest
pip install -e .
Using Docker
docker-compose up
Documentation
Comprehensive documentation is available in the docs/ directory:
- API Reference - Complete API documentation
- Plugin Development Guide - Create and use plugins
- OSINT Usage - OSINT mode examples
- Browser Automation - Browser automation guide
- Benchmarks - Performance benchmarks
- Troubleshooting - Common issues and solutions
- Architecture - System architecture overview
Architecture
┌─────────────────────────────────────────────────────────┐
│ DeepHarvest Core │
├─────────────────────────────────────────────────────────┤
│ ┌─────────────┐ ┌──────────────┐ ┌───────────────┐ │
│ │ Frontier │ │ Fetcher │ │ JS Renderer │ │
│ │ (BFS/DFS) │ │ (HTTP/2) │ │ (Playwright) │ │
│ └─────────────┘ └──────────────┘ └───────────────┘ │
│ ┌─────────────┐ ┌──────────────┐ ┌───────────────┐ │
│ │ Extractors │ │ Trap Det. │ │ URL Dedup │ │
│ │ (50+ fmt) │ │ (ML+Rules) │ │ (SimHash) │ │
│ └─────────────┘ └──────────────┘ └───────────────┘ │
├─────────────────────────────────────────────────────────┤
│ Distributed Layer │
│ ┌──────────┐ ┌───────────┐ ┌──────────┐ │
│ │ Redis │ │ Workers │ │ Storage │ │
│ │ Frontier │ │ (N proc) │ │ (S3/FS) │ │
│ └──────────┘ └───────────┘ └──────────┘ │
└─────────────────────────────────────────────────────────┘
How It Works
DeepHarvest operates as a distributed web crawling system that systematically discovers, fetches, and extracts content from websites. The architecture follows a modular design with clear separation of concerns.
Core Workflow
-
Initialization: The crawler initializes components (frontier, fetcher, extractors, ML models) based on configuration.
-
URL Management (Frontier): A priority queue manages URLs to be crawled. Supports BFS, DFS, and priority-based strategies. In distributed mode, Redis coordinates URL distribution across workers.
-
Content Fetching: The fetcher downloads web pages with retry logic, timeout handling, and rate limiting. Attempts HTTP/2 support with fallback to HTTP/1.1.
-
HTML Parsing: Multi-strategy parser with fallback chain (lxml → html5lib → html.parser) ensures robust parsing of malformed HTML.
-
JavaScript Rendering: For Single Page Applications (SPAs), Playwright renders pages, executes JavaScript, handles infinite scroll, and captures the final DOM state.
-
Content Extraction: Specialized extractors process different content types:
- Text: HTML text extraction with boilerplate removal
- Documents: PDF, DOCX, PPTX, XLSX text extraction
- Media: Image metadata, OCR, audio transcription, video metadata
- Structured Data: JSON-LD, Microdata, OpenGraph, Schema.org
-
Link Discovery: Advanced link extractor finds URLs from multiple sources:
- HTML attributes (href, src, srcset)
- JavaScript code (router.push, window.location)
- Structured data (JSON-LD, Microdata)
- Meta tags and data URIs
-
Deduplication: Three-tier deduplication system:
- SHA256: Exact URL/content duplicates
- SimHash: Near-duplicate detection (64-bit hashing)
- MinHash LSH: Scalable similarity search for large datasets
-
Trap Detection: ML and rule-based detection prevents infinite loops from:
- Calendar-based URLs (date patterns)
- Session ID parameters
- Pagination traps
- Query parameter explosions
-
Storage: Extracted content is stored with metadata. Supports filesystem, S3, and PostgreSQL backends.
-
Resume Support: DeepHarvest can resume interrupted crawls by:
- Saving checkpoint state periodically (configurable interval)
- Restoring visited URLs to prevent duplicates
- Restoring pending frontier queue to continue from where it left off
- Automatically skipping seed URLs if resuming from checkpoint
- Note: Resume is supported in local mode only; distributed mode relies on Redis persistence
Distributed Architecture
In distributed mode, multiple workers share a Redis-based frontier. Each worker:
- Pulls URLs from the shared queue
- Processes pages independently
- Respects per-host concurrency limits
- Reports metrics to centralized monitoring
This enables linear scaling: N workers process approximately N times the throughput of a single worker.
Resilience Features
- Parser Fallback: Automatic fallback between parsers when HTML is malformed
- Network Resilience: Exponential backoff retry, timeout handling, proxy support
- Memory Management: Streaming for large files, memory guards per worker
- Checkpointing: Periodic state saves enable resuming interrupted crawls
- Error Taxonomy: Structured error handling with detailed reporting
Machine Learning Integration
- Page Classification: Identifies page types (article, product, forum, etc.) for intelligent prioritization
- Soft-404 Detection: Identifies pages that return 200 but are effectively 404s
- Quality Scoring: ML-based co
