DeepHarvest

The World's Most Complete, Resilient, Multilingual Web Crawler

Features

Core Capabilities

Complete Coverage: Crawls entire websites including all subpages
All Content Types: HTML, PDF, DOCX, PPTX, XLSX, images, audio, video
JavaScript Support: Full SPA support with Playwright
Multilingual: Handles all languages, encodings, and scripts
Distributed: Redis-based distributed crawling with multiple workers
Resumable: Full checkpoint and resume support for interrupted crawls (local mode)
Intelligent: ML-based trap detection, content extraction, deduplication

Advanced Features

Smart Trap Detection: Calendar, pagination, session ID, faceted navigation
ML Content Extraction: Page classification, soft-404 detection, quality scoring
Advanced URL Management: SimHash, MinHash, LSH deduplication
Site Graph Analysis: PageRank, clustering, GraphML export
Observability: Prometheus metrics, Grafana dashboards
Extensible: Plugin system for custom extractors
OSINT Mode: Entity extraction, technology detection, link graph analysis
Browser Automation: High-level Playwright integration with screenshot capture
Pipeline Execution: YAML-based pipeline runner for complex workflows
API Server: REST API for programmatic access
Multiple Exporters: JSONL, Parquet, SQLite, VectorDB (FAISS/Chroma) support

Quick Start

Installation

pip install deepharvest

Basic Usage

Simple Crawls

# Basic crawl with depth limit
deepharvest crawl https://example.com --depth 5 --output ./output

# Crawl without JavaScript rendering (faster)
deepharvest crawl https://example.com --no-js --depth 3

# Crawl with JavaScript rendering (for SPAs)
deepharvest crawl https://example.com --js --depth 3

Limiting Crawl Scope

# Limit total number of URLs crawled
deepharvest crawl https://example.com --max-urls 1000 --depth 5

# Limit response size (skip large files)
deepharvest crawl https://example.com --max-size 10 --depth 3

# Limit pages per domain (useful for multi-domain crawls)
deepharvest crawl https://example.com --max-pages-per-domain 50 --depth 5

# Set time limit (stop after specified seconds)
deepharvest crawl https://example.com --time-limit 3600 --depth 5

# Combine multiple limits
deepharvest crawl https://example.com \
  --depth 5 \
  --max-urls 500 \
  --max-pages-per-domain 100 \
  --max-size 5 \
  --time-limit 1800 \
  --output ./output

Distributed Crawling

# Run in distributed mode with Redis
deepharvest crawl https://example.com \
  --distributed \
  --redis-url redis://localhost:6379 \
  --workers 5 \
  --depth 10

Using Configuration Files

# Use a YAML config file
deepharvest crawl --config config.yaml

Resuming Interrupted Crawls

# Resume from a checkpoint file
deepharvest resume --state-file crawl_state.json

# Resume with custom config
deepharvest resume --state-file crawl_state.json --config config.yaml

# Resume with different output directory
deepharvest resume --state-file crawl_state.json --output ./new_output

Note: Resume functionality works in local mode only. In distributed mode, Redis persistence handles state management.

OSINT Mode

# Basic OSINT collection
deepharvest osint https://example.com

# With JSON output and link graph
deepharvest osint https://example.com --json --graph

# With screenshots
deepharvest osint https://example.com --screenshot

API Server

# Start API server
deepharvest serve --host 0.0.0.0 --port 8000

Pipeline Execution

# Run a pipeline from YAML file
deepharvest run pipeline.yaml

Python API

import asyncio
from deepharvest import DeepHarvest, CrawlConfig

async def main():
    config = CrawlConfig(
        seed_urls=["https://example.com"],
        max_depth=5,
        enable_js=True
    )
    
    crawler = DeepHarvest(config)
    await crawler.initialize()
    await crawler.crawl()
    await crawler.shutdown()

asyncio.run(main())

Installation

From PyPI

pip install deepharvest

From Source

git clone https://github.com/deepharvest/deepharvest
cd deepharvest
pip install -e .

Using Docker

docker-compose up

Documentation

Comprehensive documentation is available in the docs/ directory:

API Reference - Complete API documentation
Plugin Development Guide - Create and use plugins
OSINT Usage - OSINT mode examples
Browser Automation - Browser automation guide
Benchmarks - Performance benchmarks
Troubleshooting - Common issues and solutions
Architecture - System architecture overview

Architecture

┌─────────────────────────────────────────────────────────┐
│                    DeepHarvest Core                       │
├─────────────────────────────────────────────────────────┤
│  ┌─────────────┐  ┌──────────────┐  ┌───────────────┐  │
│  │  Frontier   │  │   Fetcher    │  │  JS Renderer  │  │
│  │  (BFS/DFS)  │  │  (HTTP/2)    │  │  (Playwright) │  │
│  └─────────────┘  └──────────────┘  └───────────────┘  │
│  ┌─────────────┐  ┌──────────────┐  ┌───────────────┐  │
│  │ Extractors  │  │  Trap Det.   │  │  URL Dedup    │  │
│  │  (50+ fmt)  │  │  (ML+Rules)  │  │  (SimHash)    │  │
│  └─────────────┘  └──────────────┘  └───────────────┘  │
├─────────────────────────────────────────────────────────┤
│                  Distributed Layer                       │
│  ┌──────────┐  ┌───────────┐  ┌──────────┐            │
│  │  Redis   │  │  Workers  │  │ Storage  │            │
│  │ Frontier │  │  (N proc) │  │ (S3/FS)  │            │
│  └──────────┘  └───────────┘  └──────────┘            │
└─────────────────────────────────────────────────────────┘

How It Works

DeepHarvest operates as a distributed web crawling system that systematically discovers, fetches, and extracts content from websites. The architecture follows a modular design with clear separation of concerns.

Core Workflow

Initialization: The crawler initializes components (frontier, fetcher, extractors, ML models) based on configuration.
URL Management (Frontier): A priority queue manages URLs to be crawled. Supports BFS, DFS, and priority-based strategies. In distributed mode, Redis coordinates URL distribution across workers.
Content Fetching: The fetcher downloads web pages with retry logic, timeout handling, and rate limiting. Attempts HTTP/2 support with fallback to HTTP/1.1.
HTML Parsing: Multi-strategy parser with fallback chain (lxml → html5lib → html.parser) ensures robust parsing of malformed HTML.
JavaScript Rendering: For Single Page Applications (SPAs), Playwright renders pages, executes JavaScript, handles infinite scroll, and captures the final DOM state.
Content Extraction: Specialized extractors process different content types:
- Text: HTML text extraction with boilerplate removal
- Documents: PDF, DOCX, PPTX, XLSX text extraction
- Media: Image metadata, OCR, audio transcription, video metadata
- Structured Data: JSON-LD, Microdata, OpenGraph, Schema.org
Link Discovery: Advanced link extractor finds URLs from multiple sources:
- HTML attributes (href, src, srcset)
- JavaScript code (router.push, window.location)
- Structured data (JSON-LD, Microdata)
- Meta tags and data URIs
Deduplication: Three-tier deduplication system:
- SHA256: Exact URL/content duplicates
- SimHash: Near-duplicate detection (64-bit hashing)
- MinHash LSH: Scalable similarity search for large datasets
Trap Detection: ML and rule-based detection prevents infinite loops from:
- Calendar-based URLs (date patterns)
- Session ID parameters
- Pagination traps
- Query parameter explosions
Storage: Extracted content is stored with metadata. Supports filesystem, S3, and PostgreSQL backends.
Resume Support: DeepHarvest can resume interrupted crawls by:
- Saving checkpoint state periodically (configurable interval)
- Restoring visited URLs to prevent duplicates
- Restoring pending frontier queue to continue from where it left off
- Automatically skipping seed URLs if resuming from checkpoint
- Note: Resume is supported in local mode only; distributed mode relies on Redis persistence

Distributed Architecture

In distributed mode, multiple workers share a Redis-based frontier. Each worker:

Pulls URLs from the shared queue
Processes pages independently
Respects per-host concurrency limits
Reports metrics to centralized monitoring

This enables linear scaling: N workers process approximately N times the throughput of a single worker.

Resilience Features

Parser Fallback: Automatic fallback between parsers when HTML is malformed
Network Resilience: Exponential backoff retry, timeout handling, proxy support
Memory Management: Streaming for large files, memory guards per worker
Checkpointing: Periodic state saves enable resuming interrupted crawls
Error Taxonomy: Structured error handling with detailed reporting

Machine Learning Integration

Page Classification: Identifies page types (article, product, forum, etc.) for intelligent prioritization
Soft-404 Detection: Identifies pages that return 200 but are effectively 404s
Quality Scoring: ML-based co

DeepHarvest

Install / Use

README

DeepHarvest

Features

Core Capabilities

Advanced Features

Quick Start

Installation

Basic Usage

Simple Crawls

Limiting Crawl Scope

Distributed Crawling

Using Configuration Files

Resuming Interrupted Crawls

OSINT Mode

API Server

Pipeline Execution

Python API

Installation

From PyPI

From Source

Using Docker

Documentation

Architecture

How It Works

Core Workflow

Distributed Architecture

Resilience Features

Machine Learning Integration