SkillAgentSearch skills...

DeepHarvest

DeepHarvest is a Python web crawler with JavaScript rendering, distributed crawling, ML-based trap detection, and multilingual support. Extracts content from HTML, PDFs, Office docs, images, and media.

Install / Use

/learn @Anajrajeev/DeepHarvest
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

DeepHarvest

The World's Most Complete, Resilient, Multilingual Web Crawler

License: Apache-2.0 Python 3.9+ Docker

Features

Core Capabilities

  • Complete Coverage: Crawls entire websites including all subpages
  • All Content Types: HTML, PDF, DOCX, PPTX, XLSX, images, audio, video
  • JavaScript Support: Full SPA support with Playwright
  • Multilingual: Handles all languages, encodings, and scripts
  • Distributed: Redis-based distributed crawling with multiple workers
  • Resumable: Full checkpoint and resume support for interrupted crawls (local mode)
  • Intelligent: ML-based trap detection, content extraction, deduplication

Advanced Features

  • Smart Trap Detection: Calendar, pagination, session ID, faceted navigation
  • ML Content Extraction: Page classification, soft-404 detection, quality scoring
  • Advanced URL Management: SimHash, MinHash, LSH deduplication
  • Site Graph Analysis: PageRank, clustering, GraphML export
  • Observability: Prometheus metrics, Grafana dashboards
  • Extensible: Plugin system for custom extractors
  • OSINT Mode: Entity extraction, technology detection, link graph analysis
  • Browser Automation: High-level Playwright integration with screenshot capture
  • Pipeline Execution: YAML-based pipeline runner for complex workflows
  • API Server: REST API for programmatic access
  • Multiple Exporters: JSONL, Parquet, SQLite, VectorDB (FAISS/Chroma) support

Quick Start

Installation

pip install deepharvest

Basic Usage

Simple Crawls

# Basic crawl with depth limit
deepharvest crawl https://example.com --depth 5 --output ./output

# Crawl without JavaScript rendering (faster)
deepharvest crawl https://example.com --no-js --depth 3

# Crawl with JavaScript rendering (for SPAs)
deepharvest crawl https://example.com --js --depth 3

Limiting Crawl Scope

# Limit total number of URLs crawled
deepharvest crawl https://example.com --max-urls 1000 --depth 5

# Limit response size (skip large files)
deepharvest crawl https://example.com --max-size 10 --depth 3

# Limit pages per domain (useful for multi-domain crawls)
deepharvest crawl https://example.com --max-pages-per-domain 50 --depth 5

# Set time limit (stop after specified seconds)
deepharvest crawl https://example.com --time-limit 3600 --depth 5

# Combine multiple limits
deepharvest crawl https://example.com \
  --depth 5 \
  --max-urls 500 \
  --max-pages-per-domain 100 \
  --max-size 5 \
  --time-limit 1800 \
  --output ./output

Distributed Crawling

# Run in distributed mode with Redis
deepharvest crawl https://example.com \
  --distributed \
  --redis-url redis://localhost:6379 \
  --workers 5 \
  --depth 10

Using Configuration Files

# Use a YAML config file
deepharvest crawl --config config.yaml

Resuming Interrupted Crawls

# Resume from a checkpoint file
deepharvest resume --state-file crawl_state.json

# Resume with custom config
deepharvest resume --state-file crawl_state.json --config config.yaml

# Resume with different output directory
deepharvest resume --state-file crawl_state.json --output ./new_output

Note: Resume functionality works in local mode only. In distributed mode, Redis persistence handles state management.

OSINT Mode

# Basic OSINT collection
deepharvest osint https://example.com

# With JSON output and link graph
deepharvest osint https://example.com --json --graph

# With screenshots
deepharvest osint https://example.com --screenshot

API Server

# Start API server
deepharvest serve --host 0.0.0.0 --port 8000

Pipeline Execution

# Run a pipeline from YAML file
deepharvest run pipeline.yaml

Python API

import asyncio
from deepharvest import DeepHarvest, CrawlConfig

async def main():
    config = CrawlConfig(
        seed_urls=["https://example.com"],
        max_depth=5,
        enable_js=True
    )
    
    crawler = DeepHarvest(config)
    await crawler.initialize()
    await crawler.crawl()
    await crawler.shutdown()

asyncio.run(main())

Installation

From PyPI

pip install deepharvest

From Source

git clone https://github.com/deepharvest/deepharvest
cd deepharvest
pip install -e .

Using Docker

docker-compose up

Documentation

Comprehensive documentation is available in the docs/ directory:

Architecture

┌─────────────────────────────────────────────────────────┐
│                    DeepHarvest Core                       │
├─────────────────────────────────────────────────────────┤
│  ┌─────────────┐  ┌──────────────┐  ┌───────────────┐  │
│  │  Frontier   │  │   Fetcher    │  │  JS Renderer  │  │
│  │  (BFS/DFS)  │  │  (HTTP/2)    │  │  (Playwright) │  │
│  └─────────────┘  └──────────────┘  └───────────────┘  │
│  ┌─────────────┐  ┌──────────────┐  ┌───────────────┐  │
│  │ Extractors  │  │  Trap Det.   │  │  URL Dedup    │  │
│  │  (50+ fmt)  │  │  (ML+Rules)  │  │  (SimHash)    │  │
│  └─────────────┘  └──────────────┘  └───────────────┘  │
├─────────────────────────────────────────────────────────┤
│                  Distributed Layer                       │
│  ┌──────────┐  ┌───────────┐  ┌──────────┐            │
│  │  Redis   │  │  Workers  │  │ Storage  │            │
│  │ Frontier │  │  (N proc) │  │ (S3/FS)  │            │
│  └──────────┘  └───────────┘  └──────────┘            │
└─────────────────────────────────────────────────────────┘

How It Works

DeepHarvest operates as a distributed web crawling system that systematically discovers, fetches, and extracts content from websites. The architecture follows a modular design with clear separation of concerns.

Core Workflow

  1. Initialization: The crawler initializes components (frontier, fetcher, extractors, ML models) based on configuration.

  2. URL Management (Frontier): A priority queue manages URLs to be crawled. Supports BFS, DFS, and priority-based strategies. In distributed mode, Redis coordinates URL distribution across workers.

  3. Content Fetching: The fetcher downloads web pages with retry logic, timeout handling, and rate limiting. Attempts HTTP/2 support with fallback to HTTP/1.1.

  4. HTML Parsing: Multi-strategy parser with fallback chain (lxml → html5lib → html.parser) ensures robust parsing of malformed HTML.

  5. JavaScript Rendering: For Single Page Applications (SPAs), Playwright renders pages, executes JavaScript, handles infinite scroll, and captures the final DOM state.

  6. Content Extraction: Specialized extractors process different content types:

    • Text: HTML text extraction with boilerplate removal
    • Documents: PDF, DOCX, PPTX, XLSX text extraction
    • Media: Image metadata, OCR, audio transcription, video metadata
    • Structured Data: JSON-LD, Microdata, OpenGraph, Schema.org
  7. Link Discovery: Advanced link extractor finds URLs from multiple sources:

    • HTML attributes (href, src, srcset)
    • JavaScript code (router.push, window.location)
    • Structured data (JSON-LD, Microdata)
    • Meta tags and data URIs
  8. Deduplication: Three-tier deduplication system:

    • SHA256: Exact URL/content duplicates
    • SimHash: Near-duplicate detection (64-bit hashing)
    • MinHash LSH: Scalable similarity search for large datasets
  9. Trap Detection: ML and rule-based detection prevents infinite loops from:

    • Calendar-based URLs (date patterns)
    • Session ID parameters
    • Pagination traps
    • Query parameter explosions
  10. Storage: Extracted content is stored with metadata. Supports filesystem, S3, and PostgreSQL backends.

  11. Resume Support: DeepHarvest can resume interrupted crawls by:

    • Saving checkpoint state periodically (configurable interval)
    • Restoring visited URLs to prevent duplicates
    • Restoring pending frontier queue to continue from where it left off
    • Automatically skipping seed URLs if resuming from checkpoint
    • Note: Resume is supported in local mode only; distributed mode relies on Redis persistence

Distributed Architecture

In distributed mode, multiple workers share a Redis-based frontier. Each worker:

  • Pulls URLs from the shared queue
  • Processes pages independently
  • Respects per-host concurrency limits
  • Reports metrics to centralized monitoring

This enables linear scaling: N workers process approximately N times the throughput of a single worker.

Resilience Features

  • Parser Fallback: Automatic fallback between parsers when HTML is malformed
  • Network Resilience: Exponential backoff retry, timeout handling, proxy support
  • Memory Management: Streaming for large files, memory guards per worker
  • Checkpointing: Periodic state saves enable resuming interrupted crawls
  • Error Taxonomy: Structured error handling with detailed reporting

Machine Learning Integration

  • Page Classification: Identifies page types (article, product, forum, etc.) for intelligent prioritization
  • Soft-404 Detection: Identifies pages that return 200 but are effectively 404s
  • Quality Scoring: ML-based co
View on GitHub
GitHub Stars4
CategoryCustomer
Updated2mo ago
Forks1

Languages

Python

Security Score

85/100

Audited on Jan 22, 2026

No findings