SkillAgentSearch skills...

Crawl4ai

πŸš€πŸ€– Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper. Don't be shy, join here: https://discord.gg/jP8KfhDhyN

Install / Use

/learn @unclecode/Crawl4ai
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

πŸš€πŸ€– Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper.

<div align="center">

<a href="https://trendshift.io/repositories/11716" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11716" alt="unclecode%2Fcrawl4ai | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>

GitHub Stars GitHub Forks

PyPI version Python Version Downloads GitHub Sponsors


πŸš€ Crawl4AI Cloud API β€” Closed Beta (Launching Soon)

Reliable, large-scale web extraction, now built to be drastically more cost-effective than any of the existing solutions.

πŸ‘‰ Apply here for early access
We’ll be onboarding in phases and working closely with early users. Limited slots.


<p align="center"> <a href="https://x.com/crawl4ai"> <img src="https://img.shields.io/badge/Follow%20on%20X-000000?style=for-the-badge&logo=x&logoColor=white" alt="Follow on X" /> </a> <a href="https://www.linkedin.com/company/crawl4ai"> <img src="https://img.shields.io/badge/Follow%20on%20LinkedIn-0077B5?style=for-the-badge&logo=linkedin&logoColor=white" alt="Follow on LinkedIn" /> </a> <a href="https://discord.gg/jP8KfhDhyN"> <img src="https://img.shields.io/badge/Join%20our%20Discord-5865F2?style=for-the-badge&logo=discord&logoColor=white" alt="Join our Discord" /> </a> </p> </div>

Crawl4AI turns the web into clean, LLM ready Markdown for RAG, agents, and data pipelines. Fast, controllable, battle tested by a 50k+ star community.

✨ Check out latest update v0.8.5

✨ New in v0.8.5: Anti-Bot Detection, Shadow DOM & 60+ Bug Fixes! Automatic 3-tier anti-bot detection with proxy escalation, Shadow DOM flattening, deep crawl cancellation, config defaults API, consent popup removal, and critical security patches. Release notes β†’

✨ Recent v0.8.0: Crash Recovery & Prefetch Mode! Deep crawl crash recovery with resume_state and on_state_change callbacks for long-running crawls. New prefetch=True mode for 5-10x faster URL discovery. Release notes β†’

✨ Previous v0.7.8: Stability & Bug Fix Release! 11 bug fixes addressing Docker API issues, LLM extraction improvements, URL handling fixes, and dependency updates. Release notes β†’

✨ Previous v0.7.7: Complete Self-Hosting Platform with Real-time Monitoring! Enterprise-grade monitoring dashboard, comprehensive REST API, WebSocket streaming, and smart browser pool management. Release notes β†’

<details> <summary>πŸ€“ <strong>My Personal Story</strong></summary>

I grew up on an Amstrad, thanks to my dad, and never stopped building. In grad school I specialized in NLP and built crawlers for research. That’s where I learned how much extraction matters.

In 2023, I needed web-to-Markdown. The β€œopen source” option wanted an account, API token, and $16, and still under-delivered. I went turbo anger mode, built Crawl4AI in days, and it went viral. Now it’s the most-starred crawler on GitHub.

I made it open source for availability, anyone can use it without a gate. Now I’m building the platform for affordability, anyone can run serious crawls without breaking the bank. If that resonates, join in, send feedback, or just crawl something amazing.

</details> <details> <summary>Why developers pick Crawl4AI</summary>
  • LLM ready output, smart Markdown with headings, tables, code, citation hints
  • Fast in practice, async browser pool, caching, minimal hops
  • Full control, sessions, proxies, cookies, user scripts, hooks
  • Adaptive intelligence, learns site patterns, explores only what matters
  • Deploy anywhere, zero keys, CLI and Docker, cloud friendly
</details>

πŸš€ Quick Start

  1. Install Crawl4AI:
# Install the package
pip install -U crawl4ai

# For pre release versions
pip install crawl4ai --pre

# Run post-installation setup
crawl4ai-setup

# Verify your installation
crawl4ai-doctor

If you encounter any browser-related issues, you can install them manually:

python -m playwright install --with-deps chromium
  1. Run a simple web crawl with Python:
import asyncio
from crawl4ai import *

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
        )
        print(result.markdown)

if __name__ == "__main__":
    asyncio.run(main())
  1. Or use the new command-line interface:
# Basic crawl with markdown output
crwl https://www.nbcnews.com/business -o markdown

# Deep crawl with BFS strategy, max 10 pages
crwl https://docs.crawl4ai.com --deep-crawl bfs --max-pages 10

# Use LLM extraction with a specific question
crwl https://www.example.com/products -q "Extract all product prices"

πŸ’– Support Crawl4AI

πŸŽ‰ Sponsorship Program Now Open! After powering 51K+ developers and 1 year of growth, Crawl4AI is launching dedicated support for startups and enterprises. Be among the first 50 Founding Sponsors for permanent recognition in our Hall of Fame.

Crawl4AI is the #1 trending open-source web crawler on GitHub. Your support keeps it independent, innovative, and free for the community β€” while giving you direct access to premium benefits.

<div align="">

Become a Sponsor
Current Sponsors

</div>

🀝 Sponsorship Tiers

  • 🌱 Believer ($5/mo) β€” Join the movement for data democratization
  • πŸš€ Builder ($50/mo) β€” Priority support & early access to features
  • πŸ’Ό Growing Team ($500/mo) β€” Bi-weekly syncs & optimization help
  • 🏒 Data Infrastructure Partner ($2000/mo) β€” Full partnership with dedicated support
    Custom arrangements available - see SPONSORS.md for details & contact

Why sponsor?
No rate-limited APIs. No lock-in. Build and own your data pipeline with direct guidance from the creator of Crawl4AI.

See All Tiers & Benefits β†’

✨ Features

<details> <summary>πŸ“ <strong>Markdown Generation</strong></summary>
  • 🧹 Clean Markdown: Generates clean, structured Markdown with accurate formatting.
  • 🎯 Fit Markdown: Heuristic-based filtering to remove noise and irrelevant parts for AI-friendly processing.
  • πŸ”— Citations and References: Converts page links into a numbered reference list with clean citations.
  • πŸ› οΈ Custom Strategies: Users can create their own Markdown generation strategies tailored to specific needs.
  • πŸ“š BM25 Algorithm: Employs BM25-based filtering for extracting core information and removing irrelevant content.
</details> <details> <summary>πŸ“Š <strong>Structured Data Extraction</strong></summary>
  • πŸ€– LLM-Driven Extraction: Supports all LLMs (open-source and proprietary) for structured data extraction.
  • 🧱 Chunking Strategies: Implements chunking (topic-based, regex, sentence-level) for targeted content processing.
  • 🌌 Cosine Similarity: Find relevant content chunks based on user queries for semantic extraction.
  • πŸ”Ž CSS-Based Extraction: Fast schema-based data extraction using XPath and CSS selectors.
  • πŸ”§ Schema Definition: Define custom schemas for extracting structured JSON from repetitive patterns.
</details> <details> <summary>🌐 <strong>Browser Integration</strong></summary>
  • πŸ–₯️ Managed Browser: Use user-owned browsers with full control, avoiding bot detection.
  • πŸ”„ Remote Browser Control: Connect to Chrome Developer Tools Protocol for remote, large-scale data extraction.
  • πŸ‘€ Browser Profiler: Create and manage persistent profiles with saved authentication states, cookies, and settings.
  • πŸ”’ Session Management: Preserve browser states and reuse them for multi-step crawling.
  • 🧩 Proxy Support: Seamlessly connect to proxies with authentication for secure access.
  • βš™οΈ Full Browser Control: Modify headers, cookies, user agents, and more for tailored crawling setups.
  • 🌍 Multi-Browser Support: Compatible with Chromium, Firefox, and WebKit.
  • πŸ“ Dynamic Viewport Adjustment: Automatically adjusts the browser viewport to match page content, ensuring complete rendering and capturing of all elements.
</details> <details> <summary>πŸ”Ž <strong>Crawling & Scraping</strong></summary>
  • πŸ–ΌοΈ Media Support: Extract images, audio, videos, and responsive image formats like srcset and picture.
  • πŸš€ Dynamic Crawling: Execute JS and wait for async or sync for dynamic content extraction.
  • πŸ“Έ Screenshots: Capture page screenshots during crawling for debugging or analysis.
  • πŸ“‚ Raw Data Crawling: Directly process raw HTML (raw:) or local files (file://).
  • πŸ”— Comprehensive Link Extraction: Ext

Related Skills

View on GitHub
GitHub Stars62.7k
CategoryDevelopment
Updated22m ago
Forks6.4k

Languages

Python

Security Score

95/100

Audited on Mar 26, 2026

No findings