Scrapion

a very powerful web scrapper, can escape any kind of bot detection to read the web content trouhgfully.

Generate Convert Improve

Install / Use

/learn @aula-id/Scrapion

About this skill

Quality Score

0/100

README

Scrapion - Web Scraping Automation System..

A Python library for automated web scraping with intelligent fallback mechanisms and accessibility handling.

</div>

Features

Dual Input Modes: Accept URLs directly or search queries
Smart URL Management: Automatically split search results into main (1-5) and backup (6-10) lists
Intelligent Fallback: Retry with backup URLs if primary URLs fail
Content Extraction: Uses Playwright for robust web content retrieval
Search Integration: DuckDuckGo search with human-like behavior to evade bot detection
Structured Reports: JSON-formatted reports with success/failure tracking
Flexible Output: Output to stdout or save to file
Auto Browser Setup: Firefox browser automatically installs on first use

Installation

From PyPI (Recommended)

pip install scrapion

# Firefox browser will auto-install on first use

Build from Source

# Clone the repository
git clone https://github.com/aula-id/scrapion
cd scrapion

# Install in editable mode
pip install -e .

# Or install dependencies manually
pip install -r requirements.txt

Usage

As a Library

from scrapion import Client

# Create client (Firefox auto-installs if needed)
client = Client()

# Process single URL - report object contains all data
report = client.run("https://example.com")

# Access report data directly
print(f"Successful scrapes: {report.successful_scrapes}")
print(f"Results: {report.results}")
print(f"Report dict: {report.to_dict()}")

# Or output to stdout/file
client.output_report("stdio")

# Process search query
report = client.run("python async programming")
client.output_report("file", "./report.json")

# Skip browser check (useful in CI or when browser is pre-installed)
client = Client(skip_browser_check=True)
# Or via environment variable
# SCRAPION_SKIP_BROWSER_CHECK=1 python script.py

As a CLI Tool

# Output to stdout (JSON)
scrapion "https://example.com" --report stdio
scrapion "rust tutorial" --report stdio

# Save to file
scrapion "machine learning" --report file --output ./results.json

Architecture

Core Modules

input_handler.py: Parse and validate user input (URL vs search query)
list_manager.py: Manage URL lists (main list 1-5, backup list 6-10)
search_engine.py: DuckDuckGo search with Playwright
web_access.py: Fetch and convert web content to markdown
report_generator.py: Generate JSON reports with metadata
orchestrator.py: Main Client class workflow orchestrator (follows CONCEPT.md)

Workflow (CONCEPT.md)

User Input
    ↓
[Phase 1] Parse Input
    ├→ URL: Single URL mode
    └→ Query: Multi-URL mode

[Phase 2] Search (if query)
    ├→ Execute DuckDuckGo search
    ├→ Extract 10 URLs
    └→ Split into main (1-5) and backup (6-10)

[Phase 3] Scraping Loop
    ├→ Try main list (1-5)
    │  ├→ Success: Report and exit
    │  └→ Failure: Next from main
    └→ Try backup list (6-10)
       ├→ Success: Report and exit
       └→ Failure: Next from backup

[Phase 4] Report Generation
    └→ Compile results and output

Report Object

The client.run() method returns a Report object with the following attributes:

# Directly access report data
report.query                  # Original input (URL or query)
report.mode                   # "single_url" or "multi_url"
report.successful_scrapes     # Number of successful scrapes
report.failed_scrapes         # Number of failed scrapes
report.results                # List of ScrapeResult objects
report.failed_urls            # List of failed URLs

# Convert to dict or JSON
report.to_dict()              # Returns dictionary
report.to_json()              # Returns JSON string

# Output methods
report.print_to_stdout()      # Print JSON to stdout
report.save_to_file("path")   # Save to JSON file

JSON Structure

{
  "query": "search query or URL",
  "mode": "single_url or multi_url",
  "total_urls_attempted": 10,
  "successful_scrapes": 3,
  "failed_scrapes": 7,
  "results": [
    {
      "url": "https://example.com",
      "status": "success or failed",
      "accessible": true,
      "content": "scraped content...",
      "source": "main_list, backup_list, or single_url",
      "timestamp": "2025-10-31T08:39:07Z"
    }
  ],
  "failed_urls": ["url1", "url2"],
  "generated_at": "2025-10-31T08:39:07Z"
}

Examples

See example.py in the source repository for detailed usage examples.

If you've built from source:

python3 example.py

Configuration

Browser Setup

Firefox browser is automatically installed on first use. To skip the browser check:

# Skip via constructor parameter
client = Client(skip_browser_check=True)

# Or via environment variable
export SCRAPION_SKIP_BROWSER_CHECK=1

Module Customization

Edit relevant modules to customize:

Search engine (DuckDuckGo)
Request timeouts
Extraction rules
Output formats

License

See LICENSE file for details.

Related Skills

qqbot-channel

343.1k

QQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口，自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。

docs-writer

99.7k

`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie

model-usage

343.1k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

ddd

Guía de Principios DDD para el Proyecto > 📚 Documento Complementario : Este documento define los principios y reglas de DDD. Para ver templates de código, ejemplos detallados y guías paso

aula-id

View profile

View on GitHub

GitHub Stars11

CategoryContent

Updated4d ago

Forks3

aula-id/scrapion

Languages

Python

Security Score

90/100

Audited on Mar 27, 2026

No findings