🕷️ Spider

<img src="spider.png" /> A modern, scalable, and extensible web crawler for efficient distributed crawling and data extraction Built with asynchronous I/O, plugin architecture, and distributed task processing

✨ Features

Core Capabilities

🚀 Asynchronous Crawling - Non-blocking I/O with aiohttp and asyncio for high performance
🌐 Distributed Processing - Scale across multiple workers using Celery and Redis
💾 Database Persistence - PostgreSQL storage with SQLAlchemy ORM
🔌 Plugin Architecture - Extensible system for custom data processing
📊 Robust Logging - Console, file, and database logging for diagnostics
🔗 URL Normalization - Smart deduplication and link management

Included Plugins

🕸️ Web Scraper - Comprehensive webpage data extraction
📝 Title Logger - Extract and store page titles
🤖 Entity Extraction - NLP-based named entity recognition (spaCy)
🎭 Dynamic Scraper - JavaScript-rendered pages (Playwright)
📈 Real-time Metrics - Live crawl statistics via WebSocket

🚀 Quick Start

Prerequisites

Python 3.11+
PostgreSQL
Redis

Installation

# Clone repository
git clone https://github.com/roshanlam/spider.git
cd spider

# Install with Poetry (recommended)
poetry install

# Install Playwright browsers
poetry run playwright install chromium

# Download spaCy model
poetry run python -m spacy download en_core_web_sm

Configuration

Edit src/spider/config.yaml:

start_url: "http://example.com"
rate_limit: 1  # seconds between requests
threads: 8
timeout: 10

database:
  url: "postgresql://username@localhost/crawlerdb"

celery:
  broker_url: "redis://localhost:6379/0"
  result_backend: "redis://localhost:6379/0"

Run the Crawler

Simple way:

poetry run python run.py

Or using module:

poetry run python -m spider.main

Query Scraped Data

# View all scraped data
poetry run python query_data.py

# Or programmatically
poetry run python
>>> from spider.plugins.scraper_utils import ScraperDataQuery
>>> query = ScraperDataQuery()
>>> page = query.get_page_data("http://example.com")
>>> print(page['title'])

📦 Project Structure

spider/
├── src/spider/
│   ├── spider.py           # Core async crawler
│   ├── plugin.py           # Plugin system
│   ├── storage.py          # Database persistence
│   ├── link_finder.py      # HTML parsing and link extraction
│   ├── tasks.py            # Celery distributed tasks
│   ├── config.py           # Configuration loader
│   ├── utils.py            # URL normalization and utilities
│   └── plugins/
│       ├── web_scraper_plugin.py      # Comprehensive web scraper
│       ├── scraper_utils.py           # Query utilities
│       ├── title_logger_plugin.py     # Title extraction
│       ├── entity_extraction.py       # NLP entity extraction
│       ├── dynamic_scraper.py         # JavaScript rendering
│       └── real_time_metrics.py       # Live metrics
├── docs/                   # Documentation
├── examples/               # Usage examples
├── tests/                  # Test suite
├── run.py                  # Simple runner script
├── query_data.py           # Data query script
└── pyproject.toml          # Poetry dependencies

🔌 Plugin System

Spider uses a powerful plugin architecture for extensibility.

Using the Web Scraper Plugin

The comprehensive web scraper extracts structured data from every page:

from spider.plugins.scraper_utils import ScraperDataQuery

query = ScraperDataQuery()

# Get page data
page = query.get_page_data("http://example.com")
print(f"Title: {page['title']}")
print(f"Words: {page['word_count']}")
print(f"Links: {len(page['links'])}")

# Search pages
results = query.search_by_title("python")

# Get statistics
stats = query.get_page_statistics()
print(f"Total pages: {stats['total_pages']}")

What gets extracted:

Metadata (title, description, keywords, author, language)
Content structure (headings, word count, text analysis)
Links (internal/external with anchor text)
Images (URLs, alt text, dimensions)
Forms (actions, methods, input fields)
Social metadata (OpenGraph, Twitter Card)
Structured data (JSON-LD)
Page structure (semantic HTML)

📚 Full documentation: docs/web-scraper/

Creating Custom Plugins

from spider.plugin import Plugin

class MyPlugin(Plugin):
    async def should_run(self, url: str, content: str) -> bool:
        return True  # Run on all pages

    async def process(self, url: str, content: str) -> str:
        # Your processing logic here
        print(f"Processing {url}")
        return content

# Register in main.py
plugin_manager.register(MyPlugin())

📚 Plugin documentation: Plugin.md

🌐 Distributed Mode

Run Spider across multiple workers for large-scale crawling.

Start Celery Worker

celery -A spider.tasks.celery_app worker --loglevel=info

Queue Tasks

from spider.tasks import crawl_task

result = crawl_task.delay("https://example.com")
print(f"Task ID: {result.id}")

📊 Usage Examples

Basic Crawling

import asyncio
from spider.spider import Spider
from spider.config import config
from spider.plugin import PluginManager
from spider.plugins.web_scraper_plugin import WebScraperPlugin

# Setup
plugin_manager = PluginManager()
plugin_manager.register(WebScraperPlugin())

# Create and run crawler
crawler = Spider(config['start_url'], config, plugin_manager)
asyncio.run(crawler.crawl())

Query Data

from spider.plugins.scraper_utils import ScraperDataQuery

query = ScraperDataQuery()

# Get all pages
pages = query.get_all_pages(limit=10)
for page in pages:
    print(f"{page['url']}: {page['title']}")

# Find pages with forms
pages_with_forms = query.get_pages_with_forms()

# Export data
query.export_to_json("http://example.com", "output.json")

SEO Analysis

import json

pages = query.get_all_pages()
for page in pages:
    # Check for SEO issues
    if not page['description']:
        print(f"⚠️ Missing description: {page['url']}")

    headings = json.loads(page['headings'])
    if len(headings['h1']) == 0:
        print(f"⚠️ No H1 heading: {page['url']}")

🧪 Testing

# Run all tests
poetry run pytest

# Run web scraper plugin tests
poetry run python test_web_scraper_plugin.py

# Check coverage
poetry run pytest --cov=spider

📚 Documentation

Web Scraper Plugin - Complete plugin documentation
Quick Start Guide - Get started in 3 steps
Quick Reference - Command cheat sheet
Plugin System - Creating custom plugins
Examples - Code examples and use cases

🤝 Contributing

We welcome contributions! Here's how to get started:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Please read CONTRIBUTING.md for detailed guidelines.

📋 Requirements

Python 3.11+
PostgreSQL 12+
Redis 6+
Poetry (package manager)

See pyproject.toml for complete dependencies.

📄 License

MIT License - See LICENSE for details.

🙏 Acknowledgments

Built with:

aiohttp - Async HTTP client/server
BeautifulSoup - HTML parsing
Celery - Distributed task queue
SQLAlchemy - SQL toolkit and ORM
Playwright - Browser automation
spaCy - NLP library

🐛 Issues & Support

Bug Reports: GitHub Issues
Questions: GitHub Discussions
Documentation: docs/

Made with ❤️ by <a href="https://github.com/roshanlam">Roshan Lamichhaner</a> <a href="https://github.com/roshanlam/spider/stargazers">⭐ Star us on GitHub!</a>

Spider

Install / Use

README