Superscrape
Web scraping + AI visual intelligence that just works — anti-bot era edition
Install / Use
/learn @PHY041/SuperscrapeREADME
SuperScrape
Web scraping + AI visual intelligence that just works -- anti-bot era edition.
SuperScrape uses Camoufox (C++ anti-detection Firefox) to scrape sites that block Playwright, Selenium, and curl. Then it analyzes product images with GPT Vision to generate competitive intelligence reports.
# Scrape Amazon product images + run AI analysis
superscrape amazon visual "portable blender" --top 10
Features
- Anti-bot scraping -- Camoufox bypasses Cloudflare, DataDome, and other bot detection
- Amazon -- Product pages, search results, image extraction with hi-res upgrade
- Instagram -- Public profiles, recent posts, follower counts
- Reddit -- Subreddit posts with sorting and filtering
- eBay, Walmart, Etsy, Shopee -- Additional e-commerce platforms
- Visual Intelligence -- GPT Vision analyzes product images (type, angle, background, text, people)
- Reports -- Markdown + JSON reports with category-level insights and recommendations
Prerequisites
- Python 3.10+
- An OpenAI API key (for Visual Intelligence features)
Installation
pip install superscrape
# Install the Camoufox browser binary
python -c "from camoufox.sync_api import Camoufox; print('ready')"
Or install from source:
git clone https://github.com/PHY041/superscrape.git
cd superscrape
pip install -e ".[dev]"
Quick Start
# 1. Set your OpenAI API key (needed for visual analysis)
export OPENAI_API_KEY="sk-..."
# 2. Scrape a single Amazon product
superscrape amazon product B0CX23V2ZK
# 3. Search Amazon
superscrape amazon search "wireless earbuds" --pages 2
# 4. Run full visual intelligence pipeline
superscrape amazon visual "boys dress shirt" --top 10 --output-dir ./reports
# 5. Scrape Instagram
superscrape instagram natgeo
# 6. Scrape Reddit
superscrape reddit SideProject --sort hot --limit 50
CLI Reference
superscrape
amazon
product <ASIN> Scrape a single product
search <KEYWORD> Search results with pagination
visual <KEYWORD> Full visual intelligence pipeline
instagram <USERNAME> Public profile + recent posts
reddit <SUBREDDIT> Posts with sorting (hot/new/top)
Options
| Command | Flag | Description |
|---------|------|-------------|
| amazon product | --images-only | Only output image URLs |
| amazon search | --pages N | Number of search pages |
| amazon visual | --top N | Number of products to analyze |
| amazon visual | --no-cache | Bypass cached results |
| amazon visual | --output-dir DIR | Output directory |
| reddit | --sort hot\|new\|top | Sort order |
| reddit | --limit N | Max posts to fetch |
| All commands | --output json\|table | Output format |
Python API
from superscrape.sites.amazon import Amazon
from superscrape.analyzers.vision import batch_analyze_first_images
from superscrape.reporters.visual_report import aggregate_report, render_markdown
# Scrape
products = Amazon.search_images("portable blender", top_n=10)
# Analyze with GPT Vision
analyses = batch_analyze_first_images(products)
# Generate report
report = aggregate_report("portable blender", products, analyses)
markdown = render_markdown(report)
Environment Variables
| Variable | Required | Description |
|----------|----------|-------------|
| OPENAI_API_KEY | For visual analysis | OpenAI API key for GPT Vision |
| BYTEPLUSES_API_KEY | Optional | BytePlus API key for lifestyle image generation |
API Server (Optional)
SuperScrape includes an optional FastAPI server with real-time job tracking:
# Install API dependencies
pip install "superscrape[api]"
# Start the server
uvicorn api.main:app --host 0.0.0.0 --port 8001
# Or use Docker
docker compose up --build
API endpoints:
POST /jobs-- Submit a scraping + analysis jobGET /jobs/{id}-- Job statusGET /jobs/{id}/stream-- SSE real-time progressGET /reports-- List generated reportsGET /health-- Health check
Architecture
CLI / API Request
|
v
+---------------------------+
| Scraping Layer |
| sites/amazon.py |
| sites/instagram.py |
| sites/reddit.py |
+------------+--------------+
|
v
Camoufox Browser
(C++ anti-detection)
|
v
+---------------------------+
| AI Analysis |
| analyzers/vision.py |
| (OpenAI GPT Vision) |
+------------+--------------+
|
v
+---------------------------+
| Reports |
| reporters/visual_report |
| Markdown + JSON + HTML |
+---------------------------+
Anti-Bot Test Results
Tested with Camoufox against major platforms:
| Platform | Status | Notes | |----------|--------|-------| | Amazon | Pass | Search, product pages, images | | Instagram | Pass | Public profiles, no login required | | Reddit | Pass | Playwright+stealth gets blocked, Camoufox passes | | eBay | Pass | Product listings, prices | | Walmart | Pass | Product pages | | Etsy | Pass | Listings, prices | | Cloudflare Challenge | Pass | Generic CF challenge page |
Contributing
See CONTRIBUTING.md for development setup and guidelines.
License
MIT License -- see LICENSE for details.
Powered by CanMarket.
