GcrawlAI
Turn any website into clean, LLM-ready data. Open-source web crawler with stealth mode, distributed crawling, real-time WebSocket progress & Markdown output. Power your AI apps with GcrawlAI.
Install / Use
/learn @GramosoftAI/GcrawlAIREADME
✨ Why GcrawlAI?
Most web crawlers dump raw HTML on your lap. GcrawlAI gives your LLM exactly what it needs — clean Markdown, structured metadata, and zero noise.
Here's what you can build with it:
🔍 RAG Pipelines — Feed your retrieval-augmented generation system with clean, structured web content instead of tag soup.
🤖 AI Search Tools — Index the web semantically. GcrawlAI extracts what matters, so your search understands context, not just keywords.
📄 Document Intelligence Systems — Turn web-based reports, filings, and articles into structured data your models can actually reason over.
💰 Price Monitoring Engines — Track competitor pricing across e-commerce platforms in real time, without a single broken XPath selector.
📊 Competitor Intelligence Dashboards — Continuously extract product updates, hiring signals, and announcements from competitor websites automatically.
🌐 Market Research Aggregators — Collect and synthesize data from hundreds of sources into clean, analysis-ready datasets.
🎯 Lead Generation Pipelines — Scrape company directories, job boards, and industry listings to build targeted, enriched prospect lists.
📰 News & Regulatory Trackers — Monitor policy changes, regulatory updates, and industry news without the noise of irrelevant content.
🛍️ Product Catalog Enrichers — Pull product descriptions, specs, and images from supplier sites and normalize them into your schema automatically.
No brittle CSS selectors. No HTML parsing headaches. No maintenance nightmares when a site redesigns overnight.
GcrawlAI handles the messy web so you don't have to.
- ⚡ Instant or Deep — Single page real-time extraction or full-site distributed crawling at scale
- 🧹 LLM-Native Output — Auto Markdown conversion, clean enough to feed directly into your vector store
- 🥷 Stealth by Default — Playwright stealth mode + automatic browser fallback to bypass bot detection
- 📊 Real-Time Visibility — Live WebSocket progress tracking and an interactive dashboard
- 🔐 Secure Auth — JWT + Email OTP, production-ready from day one
- 🌍 Fully Open Source — MIT licensed. Fork it, extend it, ship it
🚀 Features
| Feature | Description | | ----------------------------- | --------------------------------------------------------------------------------------- | | Single Page Crawl | Direct, real-time extraction from any individual URL — instant results | | Full Site Crawl | Distributed crawling of entire websites via Celery workers — handles thousands of pages | | LLM-Ready Markdown | Auto-converts web content into clean Markdown optimized for LLM consumption | | HTML & Screenshot Capture | Captures raw HTML and full-page screenshots for visual and structural analysis | | SEO Metadata Extraction | Extracts title, description, keywords, and Open Graph tags automatically | | Stealth & Anti-Bot | Playwright with stealth plugins; auto-fallback (Chromium → Firefox/Camoufox) | | Real-Time Progress | Live crawl updates via WebSockets with an interactive dashboard | | Secure Auth | JWT-based auth, Email OTP signup/verification, and password reset flow |
🛠️ Technology Stack
- Backend: FastAPI, Python 3.9+
- Frontend: Angular
- Database: PostgreSQL
- Task Queue: Celery + Redis
- Browser Automation: Playwright
- Authentication: JWT, BCrypt
📋 Prerequisites
- Python 3.9+
- PostgreSQL (running on default port 5432)
- Redis (running on default port 6379)
- Git
Linux System Dependencies
If you are running on Linux (Debian/Ubuntu), you will need to install the following system dependencies for the automated browsers to function correctly:
sudo apt update
sudo apt install -y \
libnss3 \
libatk1.0-0t64 \
libatk-bridge2.0-0t64 \
libcups2t64 \
libxcomposite1 \
libxdamage1 \
libxrandr2 \
libgbm1 \
libasound2t64 \
libpangocairo-1.0-0 \
libgtk-3-0t64
⚙️ Installation
-
Clone the repository
git clone https://github.com/GramosoftAI/GcrawlAI.git cd GcrawlAI -
Create and activate virtual environment
python -m venv venv source venv/bin/activate # Linux/Mac venv\Scripts\activate # Windows -
Install dependencies
pip install -r requirements.txt -
Install Playwright browsers
playwright install
🔧 Configuration
-
Database Config: Update
config.yamlwith your PostgreSQL credentials.postgres: host: "localhost" port: 5432 database: "crawlerdb" user: "postgres" password: "your_password" -
Initialize Database Tables:
python -m api.db_setup # OR python api/db_setup.py
�♂️ Running the Application
You need to run 4 separate processes. It's recommended to use separate terminal windows.
1. Start Redis Server (if not running as a service)
redis-server
⚠️ Windows Users: Redis does not run natively on Windows. Use WSL (Windows Subsystem for Linux) or Docker instead.
2. Start Celery Worker
# Linux (User Recommended)
celery -A web_crawler.celery_config worker -l info
# Windows
celery -A web_crawler.celery_config.celery_app worker --loglevel=info --pool=solo
3. Start Backend API
# Windows / Development
uvicorn api.api:app --port 8000
# Linux / Production (User Recommended)
uvicorn api.api:app --host 0.0.0.0 --port 8000 --workers 4 --timeout-keep-alive 120
API Docs will be available at: http://localhost:8000/docs
4. Start Frontend Dashboard
<a href="https://github.com/GramosoftAI/GcrawlAI/blob/main/frontend/README.md">ReadMe for Angular Frontend</a>
Project Structure
.
├── api/ # FastAPI backend
│ ├── api.py # Main API entry point
│ ├── auth_manager.py # Authentication logic
│ └── db_setup.py # Database initialization
├── web_crawler/ # Crawler logic
│ ├── web_crawler.py # Core crawler orchestrator
│ ├── page_crawler.py # Individual page processing
│ └── celery_config.py # Celery configuration
├── config.yaml # Application configuration
└── requirements.txt # Python dependencies
🔐 API Endpoints
POST /crawler: Start a new crawl job (single or all).GET /crawler/status/{task_id}: Check Celery task status.GET /crawl/get/content: Retrieve generated content.POST /auth/signup/send-otp: reliable email-based signup.POST /auth/signup/verify-otp: reliable email-based signup.POST /auth/signin: reliable email-based signin.POST /auth/forgot-password: reliable email-based forgot password.POST /auth/reset-password: reliable email-based reset password.
Full interactive API docs available at http://localhost:8000/docs when running locally.
🤝 Contributing
Contributions are welcome and appreciated! Here's how to get involved:
- Fork the repository
- Create a feature branch —
git checkout -b feature/YourFeature - Commit your changes —
git commit -m 'Add YourFeature' - Push to your branch —
git push origin feature/YourFeature - Open a Pull Request
Please ensure your code follows the existing style and includes relevant tests. For large changes, open an issue first to discuss your proposal.
🙌 Credits & Inspiration
GcrawlAI was built by the team at Gramosoft Private Limited, inspired by the incredible open-source web scraping and AI ecosystem. We stand on the shoulders of giants:
| Project | What We Learned | | ---------------------------------------------------------------- | ------------------------------------------------------------------------------------------ | | 🔥 Firecrawl | LLM-ready markdown output, distributed crawling architecture, and benchmark-driven quality | | 🕷️ ScrapeGraphAI | Graph-based pipeline design and LLM-powered structured extraction | | 🎭 Playwright | Browser automation, stealth crawling, and anti-bot bypass strategies | | ⚡ FastAPI | High-performance async API de
Related Skills
node-connect
337.3kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
claude-opus-4-5-migration
83.2kMigrate prompts and code from Claude Sonnet 4.0, Sonnet 4.5, or Opus 4.1 to Opus 4.5
frontend-design
83.2kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
model-usage
337.3kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
