Spider
Web Crawler built using asynchronous Python and distributed task management that extracts and saves web data for analysis.
Install / Use
/learn @roshanlam/SpiderREADME
🕷️ Spider
<p align="center"> <img src="spider.png" /> </p> <p align="center"> <strong>A modern, scalable, and extensible web crawler for efficient distributed crawling and data extraction</strong> </p> <p align="center"> Built with asynchronous I/O, plugin architecture, and distributed task processing </p>✨ Features
Core Capabilities
- 🚀 Asynchronous Crawling - Non-blocking I/O with
aiohttpandasynciofor high performance - 🌐 Distributed Processing - Scale across multiple workers using Celery and Redis
- 💾 Database Persistence - PostgreSQL storage with SQLAlchemy ORM
- 🔌 Plugin Architecture - Extensible system for custom data processing
- 📊 Robust Logging - Console, file, and database logging for diagnostics
- 🔗 URL Normalization - Smart deduplication and link management
Included Plugins
- 🕸️ Web Scraper - Comprehensive webpage data extraction
- 📝 Title Logger - Extract and store page titles
- 🤖 Entity Extraction - NLP-based named entity recognition (spaCy)
- 🎭 Dynamic Scraper - JavaScript-rendered pages (Playwright)
- 📈 Real-time Metrics - Live crawl statistics via WebSocket
🚀 Quick Start
Prerequisites
- Python 3.11+
- PostgreSQL
- Redis
Installation
# Clone repository
git clone https://github.com/roshanlam/spider.git
cd spider
# Install with Poetry (recommended)
poetry install
# Install Playwright browsers
poetry run playwright install chromium
# Download spaCy model
poetry run python -m spacy download en_core_web_sm
Configuration
Edit src/spider/config.yaml:
start_url: "http://example.com"
rate_limit: 1 # seconds between requests
threads: 8
timeout: 10
database:
url: "postgresql://username@localhost/crawlerdb"
celery:
broker_url: "redis://localhost:6379/0"
result_backend: "redis://localhost:6379/0"
Run the Crawler
Simple way:
poetry run python run.py
Or using module:
poetry run python -m spider.main
Query Scraped Data
# View all scraped data
poetry run python query_data.py
# Or programmatically
poetry run python
>>> from spider.plugins.scraper_utils import ScraperDataQuery
>>> query = ScraperDataQuery()
>>> page = query.get_page_data("http://example.com")
>>> print(page['title'])
📦 Project Structure
spider/
├── src/spider/
│ ├── spider.py # Core async crawler
│ ├── plugin.py # Plugin system
│ ├── storage.py # Database persistence
│ ├── link_finder.py # HTML parsing and link extraction
│ ├── tasks.py # Celery distributed tasks
│ ├── config.py # Configuration loader
│ ├── utils.py # URL normalization and utilities
│ └── plugins/
│ ├── web_scraper_plugin.py # Comprehensive web scraper
│ ├── scraper_utils.py # Query utilities
│ ├── title_logger_plugin.py # Title extraction
│ ├── entity_extraction.py # NLP entity extraction
│ ├── dynamic_scraper.py # JavaScript rendering
│ └── real_time_metrics.py # Live metrics
├── docs/ # Documentation
├── examples/ # Usage examples
├── tests/ # Test suite
├── run.py # Simple runner script
├── query_data.py # Data query script
└── pyproject.toml # Poetry dependencies
🔌 Plugin System
Spider uses a powerful plugin architecture for extensibility.
Using the Web Scraper Plugin
The comprehensive web scraper extracts structured data from every page:
from spider.plugins.scraper_utils import ScraperDataQuery
query = ScraperDataQuery()
# Get page data
page = query.get_page_data("http://example.com")
print(f"Title: {page['title']}")
print(f"Words: {page['word_count']}")
print(f"Links: {len(page['links'])}")
# Search pages
results = query.search_by_title("python")
# Get statistics
stats = query.get_page_statistics()
print(f"Total pages: {stats['total_pages']}")
What gets extracted:
- Metadata (title, description, keywords, author, language)
- Content structure (headings, word count, text analysis)
- Links (internal/external with anchor text)
- Images (URLs, alt text, dimensions)
- Forms (actions, methods, input fields)
- Social metadata (OpenGraph, Twitter Card)
- Structured data (JSON-LD)
- Page structure (semantic HTML)
📚 Full documentation: docs/web-scraper/
Creating Custom Plugins
from spider.plugin import Plugin
class MyPlugin(Plugin):
async def should_run(self, url: str, content: str) -> bool:
return True # Run on all pages
async def process(self, url: str, content: str) -> str:
# Your processing logic here
print(f"Processing {url}")
return content
# Register in main.py
plugin_manager.register(MyPlugin())
📚 Plugin documentation: Plugin.md
🌐 Distributed Mode
Run Spider across multiple workers for large-scale crawling.
Start Celery Worker
celery -A spider.tasks.celery_app worker --loglevel=info
Queue Tasks
from spider.tasks import crawl_task
result = crawl_task.delay("https://example.com")
print(f"Task ID: {result.id}")
📊 Usage Examples
Basic Crawling
import asyncio
from spider.spider import Spider
from spider.config import config
from spider.plugin import PluginManager
from spider.plugins.web_scraper_plugin import WebScraperPlugin
# Setup
plugin_manager = PluginManager()
plugin_manager.register(WebScraperPlugin())
# Create and run crawler
crawler = Spider(config['start_url'], config, plugin_manager)
asyncio.run(crawler.crawl())
Query Data
from spider.plugins.scraper_utils import ScraperDataQuery
query = ScraperDataQuery()
# Get all pages
pages = query.get_all_pages(limit=10)
for page in pages:
print(f"{page['url']}: {page['title']}")
# Find pages with forms
pages_with_forms = query.get_pages_with_forms()
# Export data
query.export_to_json("http://example.com", "output.json")
SEO Analysis
import json
pages = query.get_all_pages()
for page in pages:
# Check for SEO issues
if not page['description']:
print(f"⚠️ Missing description: {page['url']}")
headings = json.loads(page['headings'])
if len(headings['h1']) == 0:
print(f"⚠️ No H1 heading: {page['url']}")
🧪 Testing
# Run all tests
poetry run pytest
# Run web scraper plugin tests
poetry run python test_web_scraper_plugin.py
# Check coverage
poetry run pytest --cov=spider
📚 Documentation
- Web Scraper Plugin - Complete plugin documentation
- Quick Start Guide - Get started in 3 steps
- Quick Reference - Command cheat sheet
- Plugin System - Creating custom plugins
- Examples - Code examples and use cases
🤝 Contributing
We welcome contributions! Here's how to get started:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Please read CONTRIBUTING.md for detailed guidelines.
📋 Requirements
- Python 3.11+
- PostgreSQL 12+
- Redis 6+
- Poetry (package manager)
See pyproject.toml for complete dependencies.
📄 License
MIT License - See LICENSE for details.
🙏 Acknowledgments
Built with:
- aiohttp - Async HTTP client/server
- BeautifulSoup - HTML parsing
- Celery - Distributed task queue
- SQLAlchemy - SQL toolkit and ORM
- Playwright - Browser automation
- spaCy - NLP library
🐛 Issues & Support
- Bug Reports: GitHub Issues
- Questions: GitHub Discussions
- Documentation: docs/
<p align="center"> Made with ❤️ by <a href="https://github.com/roshanlam">Roshan Lamichhaner</a> </p> <p align="center"> <a href="https://github.com/roshanlam/spider/stargazers">⭐ Star us on GitHub!</a> </p>
