GcrawlAI

Turn any website into clean, LLM-ready data. Open-source web crawler with stealth mode, distributed crawling, real-time WebSocket progress & Markdown output. Power your AI apps with GcrawlAI.

Generate Convert Improve

Install / Use

/learn @GramosoftAI/GcrawlAI

About this skill

Quality Score

0/100

README

✨ Why GcrawlAI?

Most web crawlers dump raw HTML on your lap. GcrawlAI gives your LLM exactly what it needs — clean Markdown, structured metadata, and zero noise.

Here's what you can build with it:

🔍 RAG Pipelines — Feed your retrieval-augmented generation system with clean, structured web content instead of tag soup.

🤖 AI Search Tools — Index the web semantically. GcrawlAI extracts what matters, so your search understands context, not just keywords.

📄 Document Intelligence Systems — Turn web-based reports, filings, and articles into structured data your models can actually reason over.

💰 Price Monitoring Engines — Track competitor pricing across e-commerce platforms in real time, without a single broken XPath selector.

📊 Competitor Intelligence Dashboards — Continuously extract product updates, hiring signals, and announcements from competitor websites automatically.

🌐 Market Research Aggregators — Collect and synthesize data from hundreds of sources into clean, analysis-ready datasets.

🎯 Lead Generation Pipelines — Scrape company directories, job boards, and industry listings to build targeted, enriched prospect lists.

📰 News & Regulatory Trackers — Monitor policy changes, regulatory updates, and industry news without the noise of irrelevant content.

🛍️ Product Catalog Enrichers — Pull product descriptions, specs, and images from supplier sites and normalize them into your schema automatically.

No brittle CSS selectors. No HTML parsing headaches. No maintenance nightmares when a site redesigns overnight.

GcrawlAI handles the messy web so you don't have to.

⚡ Instant or Deep — Single page real-time extraction or full-site distributed crawling at scale
🧹 LLM-Native Output — Auto Markdown conversion, clean enough to feed directly into your vector store
🥷 Stealth by Default — Playwright stealth mode + automatic browser fallback to bypass bot detection
📊 Real-Time Visibility — Live WebSocket progress tracking and an interactive dashboard
🔐 Secure Auth — JWT + Email OTP, production-ready from day one
🌍 Fully Open Source — MIT licensed. Fork it, extend it, ship it

🚀 Features

| Feature | Description | | ----------------------------- | --------------------------------------------------------------------------------------- | | Single Page Crawl | Direct, real-time extraction from any individual URL — instant results | | Full Site Crawl | Distributed crawling of entire websites via Celery workers — handles thousands of pages | | LLM-Ready Markdown | Auto-converts web content into clean Markdown optimized for LLM consumption | | HTML & Screenshot Capture | Captures raw HTML and full-page screenshots for visual and structural analysis | | SEO Metadata Extraction | Extracts title, description, keywords, and Open Graph tags automatically | | Stealth & Anti-Bot | Playwright with stealth plugins; auto-fallback (Chromium → Firefox/Camoufox) | | Real-Time Progress | Live crawl updates via WebSockets with an interactive dashboard | | Secure Auth | JWT-based auth, Email OTP signup/verification, and password reset flow |

🛠️ Technology Stack

Backend: FastAPI, Python 3.9+
Frontend: Angular
Database: PostgreSQL
Task Queue: Celery + Redis
Browser Automation: Playwright
Authentication: JWT, BCrypt

📋 Prerequisites

Python 3.9+
PostgreSQL (running on default port 5432)
Redis (running on default port 6379)
Git

Linux System Dependencies

If you are running on Linux (Debian/Ubuntu), you will need to install the following system dependencies for the automated browsers to function correctly:

sudo apt update

sudo apt install -y \
libnss3 \
libatk1.0-0t64 \
libatk-bridge2.0-0t64 \
libcups2t64 \
libxcomposite1 \
libxdamage1 \
libxrandr2 \
libgbm1 \
libasound2t64 \
libpangocairo-1.0-0 \
libgtk-3-0t64

⚙️ Installation

Clone the repository

git clone https://github.com/GramosoftAI/GcrawlAI.git
cd GcrawlAI

Create and activate virtual environment

python -m venv venv
source venv/bin/activate  # Linux/Mac
venv\Scripts\activate     # Windows

Install dependencies
```
pip install -r requirements.txt
```
Install Playwright browsers
```
playwright install
```

🔧 Configuration

Database Config: Update config.yaml with your PostgreSQL credentials.

postgres:
  host: "localhost"
  port: 5432
  database: "crawlerdb"
  user: "postgres"
  password: "your_password"

Initialize Database Tables:

python -m api.db_setup
# OR
python api/db_setup.py

�‍♂️ Running the Application

You need to run 4 separate processes. It's recommended to use separate terminal windows.

1. Start Redis Server (if not running as a service)

redis-server

⚠️ Windows Users: Redis does not run natively on Windows. Use WSL (Windows Subsystem for Linux) or Docker instead.

2. Start Celery Worker

# Linux (User Recommended)
celery -A web_crawler.celery_config worker -l info

# Windows
celery -A web_crawler.celery_config.celery_app worker --loglevel=info --pool=solo

3. Start Backend API

# Windows / Development
uvicorn api.api:app --port 8000

# Linux / Production (User Recommended)
uvicorn api.api:app --host 0.0.0.0 --port 8000 --workers 4 --timeout-keep-alive 120

API Docs will be available at: http://localhost:8000/docs

4. Start Frontend Dashboard

<a href="https://github.com/GramosoftAI/GcrawlAI/blob/main/frontend/README.md">ReadMe for Angular Frontend</a>

Project Structure

.
├── api/                    # FastAPI backend
│   ├── api.py              # Main API entry point
│   ├── auth_manager.py     # Authentication logic
│   └── db_setup.py         # Database initialization
├── web_crawler/            # Crawler logic
│   ├── web_crawler.py      # Core crawler orchestrator
│   ├── page_crawler.py     # Individual page processing
│   └── celery_config.py    # Celery configuration
├── config.yaml             # Application configuration
└── requirements.txt        # Python dependencies

🔐 API Endpoints

POST /crawler: Start a new crawl job (single or all).
GET /crawler/status/{task_id}: Check Celery task status.
GET /crawl/get/content: Retrieve generated content.
POST /auth/signup/send-otp: reliable email-based signup.
POST /auth/signup/verify-otp: reliable email-based signup.
POST /auth/signin: reliable email-based signin.
POST /auth/forgot-password: reliable email-based forgot password.
POST /auth/reset-password: reliable email-based reset password.

Full interactive API docs available at http://localhost:8000/docs when running locally.

🤝 Contributing

Contributions are welcome and appreciated! Here's how to get involved:

Fork the repository
Create a feature branch — git checkout -b feature/YourFeature
Commit your changes — git commit -m 'Add YourFeature'
Push to your branch — git push origin feature/YourFeature
Open a Pull Request

Please ensure your code follows the existing style and includes relevant tests. For large changes, open an issue first to discuss your proposal.

🙌 Credits & Inspiration

GcrawlAI was built by the team at Gramosoft Private Limited, inspired by the incredible open-source web scraping and AI ecosystem. We stand on the shoulders of giants:

| Project | What We Learned | | ---------------------------------------------------------------- | ------------------------------------------------------------------------------------------ | | 🔥 Firecrawl | LLM-ready markdown output, distributed crawling architecture, and benchmark-driven quality | | 🕷️ ScrapeGraphAI | Graph-based pipeline design and LLM-powered structured extraction | | 🎭 Playwright | Browser automation, stealth crawling, and anti-bot bypass strategies | | ⚡ FastAPI | High-performance async API de

Related Skills

node-connect

337.3k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

claude-opus-4-5-migration

83.2k

Migrate prompts and code from Claude Sonnet 4.0, Sonnet 4.5, or Opus 4.1 to Opus 4.5

frontend-design

83.2k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

model-usage

337.3k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

GramosoftAI

View profile

View on GitHub

GitHub Stars19

CategoryDevelopment

Updated8d ago

Forks2

GramosoftAI/GcrawlAI

Languages

Python

Security Score

95/100

Audited on Mar 18, 2026

No findings