SkillAgentSearch skills...

Lexicrawler

LexiCrawler is a powerful Go-based web crawling API meticulously designed to extract, clean, and transform web page content into a pristine Markdown format, perfectly tailored for Large Language Models (LLMs). Stop feeding your LLMs messy HTML – start giving them the clear, structured text they crave with LexiCrawler!

Install / Use

/learn @h2210316651/Lexicrawler
About this skill

Quality Score

0/100

Category

Design

Supported Platforms

Universal

README

LexiCrawler - Your LLM-Ready Web Content Harvester

License: MIT

LexiCrawler is a powerful Go-based web crawling API meticulously designed to extract, clean, and transform web page content into a pristine Markdown format, perfectly tailored for Large Language Models (LLMs). Stop feeding your LLMs messy HTML – start giving them the clear, structured text they crave with LexiCrawler!


✨ Key Features - Supercharge Your LLM Data Pipeline

LexiCrawler isn't just another web crawler; it's a content refinement engine built for the AI era. Here's what makes it stand out:

  • 📝 LLM-Optimized Markdown Output: Delivers content in clean, well-formatted Markdown, the ideal input for optimal LLM performance. Say goodbye to HTML parsing headaches in your AI workflows.

  • 📖 Intelligent Readability Enhancement: Powered by go-shiori/go-readability, LexiCrawler expertly strips away website clutter – navigation, ads, sidebars – focusing on the core, readable article content. Maximize the signal, minimize the noise for your models.

  • ▶️ Dynamic Content Mastery with JavaScript Rendering: Utilizing chromedp, LexiCrawler conquers modern web pages. It executes JavaScript, ensuring you capture dynamically loaded content that static scrapers miss. No page is out of reach!

  • 🕷️ Efficient & Configurable Web Crawling: Built on gocolly/colly, LexiCrawler offers robust, asynchronous crawling. Define allowed domains, set crawl depths, and respect robots.txt – all with Go speed and efficiency.

  • Essential Metadata Extraction: Automatically extracts crucial metadata like page titles and descriptions, providing valuable context alongside the content for richer LLM understanding.

  • Structured Data Snippets (Example Included): Demonstrates the power to extract structured information. The included example extracts blog post titles and links, showcasing the potential for tailored data harvesting.

  • 📸 Optional Screenshot Capture: Need visual documentation? LexiCrawler can capture screenshots of crawled pages, providing a visual record alongside the text content.

  • 📦 Smart Content Caching: Reduces redundant crawling and speeds up development with built-in in-memory caching. Get faster iterations and save on network resources.

  • Basic Heuristics Filtering: Includes initial heuristics to filter out very short paragraphs, further refining content quality and focusing on substantial text.

  • ⚙️ Highly Configurable: Tailor LexiCrawler to your specific needs with a comprehensive configuration:

    • Target URL and Allowed Domains
    • Maximum Crawl Depth
    • JavaScript Execution Control
    • Screenshotting Toggle
    • Caching Enable/Disable
    • Readability Feature Switch
    • Heuristics On/Off
  • 🔌 Simple REST API Interface: Exposed as a straightforward REST API using gofiber, making integration into your existing applications and data pipelines effortless. Just send a URL and receive clean Markdown!

  • 🚀 Built with Go Performance: Leverage the speed, concurrency, and efficiency of the Go programming language for rapid and scalable web crawling.


🚀 Getting Started - Crawl in Minutes

Ready to unleash LexiCrawler? Follow these simple steps:

Prerequisites

  • Go Installation: Ensure you have Go installed on your system. Download from https://go.dev/dl/.

  • Run the API:

    go run main.go
    

    LexiCrawler API will now be running at http://localhost:3000.

Basic Usage

Send a GET request to the /crawl endpoint with the url query parameter to crawl a specific webpage and receive its Markdown content:

curl "http://localhost:3000/crawl?url=https://www.example.com"

Example Response:

# Example Domain

This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.

[More information...](http://www.iana.org/domains/example)

⚙️ Configuration - Fine-Tune Your Crawls

LexiCrawler offers several configuration options to tailor its behavior. You can control these via:

  • API Query Parameters: For on-the-fly adjustments per request.
  • Modifying CrawlerConfig in main.go: For setting default crawler behaviors.

API Query Parameters

| Parameter | Description | Type | Default | |-----------------|-----------------------------------------------------------------------------|---------|-------------| | url | Required. The URL to crawl. | String | - | | readability | Enable/disable readability enhancement. | Boolean | false | | js | Enable/disable JavaScript rendering (dynamic content handling). | Boolean | false | | screenshots | Enable/disable screenshot capture. | Boolean | false | | cache | Enable/disable content caching. | Boolean | false | | heuristics | Enable/disable basic heuristics filtering. | Boolean | false | | content_selectors | Comma-separated CSS selectors to target specific content sections. | String | (Full page) |

Example API Request with Parameters:

curl "http://localhost:3000/crawl?url=https://blog.example.com/article-title&readability=true&js=true&screenshots=false"

CrawlerConfig Options (in main.go)

config := CrawlerConfig{
    StartURL:        "", // Set via API parameter
    AllowedDomains:  []string{}, // Dynamically set from URL
    MaxDepth:        2,        // Default crawl depth
    EnableJS:        false,    // Default JS rendering off
    EnableScreenshots: false, // Default screenshots off
    CacheEnabled:    false,    // Default caching off
    HeuristicsEnabled: false, // Default heuristics off
    EnableReadability: false, // Default readability off
    // ContentSelectors: []string{}, // Can be set here or via API parameter
}

Modify these values in main.go to set the default behavior of your crawler. API query parameters will override these defaults for individual requests.


📚 Usage Examples - Unlocking Web Content for LLMs

Here are a few examples to illustrate LexiCrawler's versatility:

1. Crawling a Blog Post with Readability and Markdown Output:

curl "http://localhost:3000/crawl?url=https://example-blog.com/great-article&readability=true"

Sample Markdown Output (cleaned and readable):

## The Greatness of Example Blog Articles

This is the main content of a fantastic blog article...

... more insightful paragraphs ...

**Key Takeaways:**

*   Point 1
*   Point 2
*   Point 3

[Read the full article on Example Blog](https://example-blog.com/great-article)

2. Crawling a Dynamic Web Application with JavaScript Rendering:

curl "http://localhost:3000/crawl?url=https://dynamic-webapp.com/dashboard&js=true"

LexiCrawler will use chromedp to render the page, ensuring content loaded by JavaScript is captured.

3. Targeting Specific Content Sections with CSS Selectors:

Let's say you only want to extract the main article body from a news website, identified by the CSS class .article-body:

curl "http://localhost:3000/crawl?url=https://news-site.com/latest-news&content_selectors=.article-body"

LexiCrawler will only process and return the Markdown content found within elements matching the .article-body selector.


🤝 Contributing - Build the Future of LLM Data

LexiCrawler is open source and thrives on community contributions! We welcome:

  • Feature Requests: Have a great idea to enhance LexiCrawler? Open an issue!
  • Bug Reports: Found a bug? Please report it with clear steps to reproduce.
  • Pull Requests: Code contributions are highly appreciated! Please follow these guidelines:
    • Code Style: Adhere to standard Go coding conventions.
    • Testing: Include tests for new features or bug fixes whenever possible.
    • Clear Commit Messages: Write descriptive commit messages explaining your changes.

To contribute, please fork the repository, make your changes in a branch, and submit a pull request.


📜 License

LexiCrawler is released under the MIT License. Feel free to use, modify, and distribute it as you wish.


Author & Maintainer

Srinath Pulaverthi

Let's make web data truly LLM-ready together! Happy crawling!

View on GitHub
GitHub Stars48
CategoryDesign
Updated6mo ago
Forks1

Languages

Go

Security Score

67/100

Audited on Sep 10, 2025

No findings