Crawl4ai
ππ€ Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper. Don't be shy, join here: https://discord.gg/jP8KfhDhyN
Install / Use
/learn @unclecode/Crawl4aiREADME
ππ€ Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper.
<div align="center"><a href="https://trendshift.io/repositories/11716" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11716" alt="unclecode%2Fcrawl4ai | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
π Crawl4AI Cloud API β Closed Beta (Launching Soon)
Reliable, large-scale web extraction, now built to be drastically more cost-effective than any of the existing solutions.
π Apply here for early access
Weβll be onboarding in phases and working closely with early users.
Limited slots.
<p align="center"> <a href="https://x.com/crawl4ai"> <img src="https://img.shields.io/badge/Follow%20on%20X-000000?style=for-the-badge&logo=x&logoColor=white" alt="Follow on X" /> </a> <a href="https://www.linkedin.com/company/crawl4ai"> <img src="https://img.shields.io/badge/Follow%20on%20LinkedIn-0077B5?style=for-the-badge&logo=linkedin&logoColor=white" alt="Follow on LinkedIn" /> </a> <a href="https://discord.gg/jP8KfhDhyN"> <img src="https://img.shields.io/badge/Join%20our%20Discord-5865F2?style=for-the-badge&logo=discord&logoColor=white" alt="Join our Discord" /> </a> </p> </div>
Crawl4AI turns the web into clean, LLM ready Markdown for RAG, agents, and data pipelines. Fast, controllable, battle tested by a 50k+ star community.
β¨ Check out latest update v0.8.5
β¨ New in v0.8.5: Anti-Bot Detection, Shadow DOM & 60+ Bug Fixes! Automatic 3-tier anti-bot detection with proxy escalation, Shadow DOM flattening, deep crawl cancellation, config defaults API, consent popup removal, and critical security patches. Release notes β
β¨ Recent v0.8.0: Crash Recovery & Prefetch Mode! Deep crawl crash recovery with resume_state and on_state_change callbacks for long-running crawls. New prefetch=True mode for 5-10x faster URL discovery. Release notes β
β¨ Previous v0.7.8: Stability & Bug Fix Release! 11 bug fixes addressing Docker API issues, LLM extraction improvements, URL handling fixes, and dependency updates. Release notes β
β¨ Previous v0.7.7: Complete Self-Hosting Platform with Real-time Monitoring! Enterprise-grade monitoring dashboard, comprehensive REST API, WebSocket streaming, and smart browser pool management. Release notes β
<details> <summary>π€ <strong>My Personal Story</strong></summary>I grew up on an Amstrad, thanks to my dad, and never stopped building. In grad school I specialized in NLP and built crawlers for research. Thatβs where I learned how much extraction matters.
In 2023, I needed web-to-Markdown. The βopen sourceβ option wanted an account, API token, and $16, and still under-delivered. I went turbo anger mode, built Crawl4AI in days, and it went viral. Now itβs the most-starred crawler on GitHub.
I made it open source for availability, anyone can use it without a gate. Now Iβm building the platform for affordability, anyone can run serious crawls without breaking the bank. If that resonates, join in, send feedback, or just crawl something amazing.
</details> <details> <summary>Why developers pick Crawl4AI</summary>- LLM ready output, smart Markdown with headings, tables, code, citation hints
- Fast in practice, async browser pool, caching, minimal hops
- Full control, sessions, proxies, cookies, user scripts, hooks
- Adaptive intelligence, learns site patterns, explores only what matters
- Deploy anywhere, zero keys, CLI and Docker, cloud friendly
π Quick Start
- Install Crawl4AI:
# Install the package
pip install -U crawl4ai
# For pre release versions
pip install crawl4ai --pre
# Run post-installation setup
crawl4ai-setup
# Verify your installation
crawl4ai-doctor
If you encounter any browser-related issues, you can install them manually:
python -m playwright install --with-deps chromium
- Run a simple web crawl with Python:
import asyncio
from crawl4ai import *
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://www.nbcnews.com/business",
)
print(result.markdown)
if __name__ == "__main__":
asyncio.run(main())
- Or use the new command-line interface:
# Basic crawl with markdown output
crwl https://www.nbcnews.com/business -o markdown
# Deep crawl with BFS strategy, max 10 pages
crwl https://docs.crawl4ai.com --deep-crawl bfs --max-pages 10
# Use LLM extraction with a specific question
crwl https://www.example.com/products -q "Extract all product prices"
π Support Crawl4AI
π Sponsorship Program Now Open! After powering 51K+ developers and 1 year of growth, Crawl4AI is launching dedicated support for startups and enterprises. Be among the first 50 Founding Sponsors for permanent recognition in our Hall of Fame.
Crawl4AI is the #1 trending open-source web crawler on GitHub. Your support keeps it independent, innovative, and free for the community β while giving you direct access to premium benefits.
<div align=""> </div>π€ Sponsorship Tiers
- π± Believer ($5/mo) β Join the movement for data democratization
- π Builder ($50/mo) β Priority support & early access to features
- πΌ Growing Team ($500/mo) β Bi-weekly syncs & optimization help
- π’ Data Infrastructure Partner ($2000/mo) β Full partnership with dedicated support
Custom arrangements available - see SPONSORS.md for details & contact
Why sponsor?
No rate-limited APIs. No lock-in. Build and own your data pipeline with direct guidance from the creator of Crawl4AI.
β¨ Features
<details> <summary>π <strong>Markdown Generation</strong></summary>- π§Ή Clean Markdown: Generates clean, structured Markdown with accurate formatting.
- π― Fit Markdown: Heuristic-based filtering to remove noise and irrelevant parts for AI-friendly processing.
- π Citations and References: Converts page links into a numbered reference list with clean citations.
- π οΈ Custom Strategies: Users can create their own Markdown generation strategies tailored to specific needs.
- π BM25 Algorithm: Employs BM25-based filtering for extracting core information and removing irrelevant content.
- π€ LLM-Driven Extraction: Supports all LLMs (open-source and proprietary) for structured data extraction.
- π§± Chunking Strategies: Implements chunking (topic-based, regex, sentence-level) for targeted content processing.
- π Cosine Similarity: Find relevant content chunks based on user queries for semantic extraction.
- π CSS-Based Extraction: Fast schema-based data extraction using XPath and CSS selectors.
- π§ Schema Definition: Define custom schemas for extracting structured JSON from repetitive patterns.
- π₯οΈ Managed Browser: Use user-owned browsers with full control, avoiding bot detection.
- π Remote Browser Control: Connect to Chrome Developer Tools Protocol for remote, large-scale data extraction.
- π€ Browser Profiler: Create and manage persistent profiles with saved authentication states, cookies, and settings.
- π Session Management: Preserve browser states and reuse them for multi-step crawling.
- π§© Proxy Support: Seamlessly connect to proxies with authentication for secure access.
- βοΈ Full Browser Control: Modify headers, cookies, user agents, and more for tailored crawling setups.
- π Multi-Browser Support: Compatible with Chromium, Firefox, and WebKit.
- π Dynamic Viewport Adjustment: Automatically adjusts the browser viewport to match page content, ensuring complete rendering and capturing of all elements.
- πΌοΈ Media Support: Extract images, audio, videos, and responsive image formats like
srcsetandpicture. - π Dynamic Crawling: Execute JS and wait for async or sync for dynamic content extraction.
- πΈ Screenshots: Capture page screenshots during crawling for debugging or analysis.
- π Raw Data Crawling: Directly process raw HTML (
raw:) or local files (file://). - π Comprehensive Link Extraction: Ext
Related Skills
node-connect
337.1kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
83.1kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
337.1kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
83.1kCommit, push, and open a PR
