Supercrawler

A web crawler. Supercrawler automatically crawls websites. Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits.

Generate Convert Improve

Install / Use

/learn @brendonboshell/Supercrawler

About this skill

Quality Score

0/100

README

Node.js Web Crawler

Supercrawler is a Node.js web crawler. It is designed to be highly configurable and easy to use.

When Supercrawler successfully crawls a page (which could be an image, a text document or any other file), it will fire your custom content-type handlers. Define your own custom handlers to parse pages, save data and do anything else you need.

Features

Link Detection. Supercrawler will parse crawled HTML documents, identify links and add them to the queue.
Robots Parsing. Supercrawler will request robots.txt and check the rules before crawling. It will also identify any sitemaps.
Sitemaps Parsing. Supercrawler will read links from XML sitemap files, and add links to the queue.
Concurrency Limiting. Supercrawler limits the number of requests sent out at any one time.
Rate limiting. Supercrawler will add a delay between requests to avoid bombarding servers.
Exponential Backoff Retry. Supercrawler will retry failed requests after 1 hour, then 2 hours, then 4 hours, etc. To use this feature, you must use the database-backed or Redis-backed crawl queue.
Hostname Balancing. Supercrawler will fairly split requests between different hostnames. To use this feature, you must use the Redis-backed crawl queue.

How It Works

Crawling is controlled by the an instance of the Crawler object, which acts like a web client. It is responsible for coordinating with the priority queue, sending requests according to the concurrency and rate limits, checking the robots.txt rules and despatching content to the custom content handlers to be processed. Once started, it will automatically crawl pages until you ask it to stop.

The Priority Queue or UrlList keeps track of which URLs need to be crawled, and the order in which they are to be crawled. The Crawler will pass new URLs discovered by the content handlers to the priority queue. When the crawler is ready to crawl the next page, it will call the getNextUrl method. This method will work out which URL should be crawled next, based on implementation-specific rules. Any retry logic is handled by the queue.

The Content Handlers are functions which take content buffers and do some further processing with them. You will almost certainly want to create your own content handlers to analyze pages or store data, for example. The content handlers tell the Crawler about new URLs that should be crawled in the future. Supercrawler provides content handlers to parse links from HTML pages, analyze robots.txt files for Sitemap: directives and parse sitemap files for URLs.

Get Started

First, install Supercrawler.

npm install supercrawler --save

Second, create an instance of Crawler.

var supercrawler = require("supercrawler");

// 1. Create a new instance of the Crawler object, providing configuration
// details. Note that configuration cannot be changed after the object is
// created.
var crawler = new supercrawler.Crawler({
  // By default, Supercrawler uses a simple FIFO queue, which doesn't support
  // retries or memory of crawl state. For any non-trivial crawl, you should
  // create a database. Provide your database config to the constructor of
  // DbUrlList.
  urlList: new supercrawler.DbUrlList({
    db: {
      database: "crawler",
      username: "root",
      password: secrets.db.password,
      sequelizeOpts: {
        dialect: "mysql",
        host: "localhost"
      }
    }
  }),
  // Tme (ms) between requests
  interval: 1000,
  // Maximum number of requests at any one time.
  concurrentRequestsLimit: 5,
  // Time (ms) to cache the results of robots.txt queries.
  robotsCacheTime: 3600000,
  // Query string to use during the crawl.
  userAgent: "Mozilla/5.0 (compatible; supercrawler/1.0; +https://github.com/brendonboshell/supercrawler)",
  // Custom options to be passed to request.
  request: {
    headers: {
      'x-custom-header': 'example'
    }
  }
});

Third, add some content handlers.

// Get "Sitemaps:" directives from robots.txt
crawler.addHandler(supercrawler.handlers.robotsParser());

// Crawl sitemap files and extract their URLs.
crawler.addHandler(supercrawler.handlers.sitemapsParser());

// Pick up <a href> links from HTML documents
crawler.addHandler("text/html", supercrawler.handlers.htmlLinkParser({
  // Restrict discovered links to the following hostnames.
  hostnames: ["example.com"]
}));

// Match an array of content-type
crawler.addHandler(["text/plain", "text/html"], myCustomHandler);

// Custom content handler for HTML pages.
crawler.addHandler("text/html", function (context) {
  var sizeKb = Buffer.byteLength(context.body) / 1024;
  logger.info("Processed", context.url, "Size=", sizeKb, "KB");
});

Fourth, add a URL to the queue and start the crawl.

crawler.getUrlList()
  .insertIfNotExists(new supercrawler.Url("http://example.com/"))
  .then(function () {
    return crawler.start();
  });

That's it! Supercrawler will handle the crawling for you. You only have to define your custom behaviour in the content handlers.

Crawler

Each Crawler instance represents a web crawler. You can configure your crawler with the following options:

| Option | Description | | --- | --- | | urlList | Custom instance of UrlList type queue. Defaults to FifoUrlList, which processes URLs in the order that they were added to the queue; once they are removed from the queue, they cannot be recrawled. | | interval | Number of milliseconds between requests. Defaults to 1000. | | concurrentRequestsLimit | Maximum number of concurrent requests. Defaults to 5. | | robotsEnabled | Indicates if the robots.txt is downloaded and checked. Defaults to true. | | robotsCacheTime | Number of milliseconds that robots.txt should be cached for. Defaults to 3600000 (1 hour). | | robotsIgnoreServerError | Indicates if 500 status code response for robots.txt should be ignored. Defaults to false. | | userAgent | User agent to use for requests. This can be either a string or a function that takes the URL being crawled. Defaults to Mozilla/5.0 (compatible; supercrawler/1.0; +https://github.com/brendonboshell/supercrawler). | | request | Object of options to be passed to request. Note that request does not support an asynchronous (and distributed) cookie jar. |

Example usage:

var crawler = new supercrawler.Crawler({
  interval: 1000,
  concurrentRequestsLimit: 1
});

The following methods are available:

| Method | Description | | --- | --- | | getUrlList | Get the UrlList type instance. | | getInterval | Get the interval setting. | | getConcurrentRequestsLimit | Get the maximum number of concurrent requests. | | getUserAgent | Get the user agent. | | start | Start crawling. | | stop | Stop crawling. | | addHandler(handler) | Add a handler for all content types. | | addHandler(contentType, handler) | Add a handler for a specific content type. If contentType is a string, then (for example) 'text' will match 'text/html', 'text/plain', etc. If contentType is an array of strings, the page content type must match exactly. |

The Crawler object fires the following events:

| Event | Description | | --- | --- | | crawlurl(url) | Fires when crawling starts with a new URL. | | crawledurl(url, errorCode, statusCode, errorMessage) | Fires when crawling of a URL is complete. errorCode is null if no error occurred. statusCode is set if and only if the request was successful. errorMessage is null if no error occurred. | | urllistempty | Fires when the URL list is (intermittently) empty. | | urllistcomplete | Fires when the URL list is permanently empty, barring URLs added by external sources. This only makes sense when running Supercrawler in non-distributed fashion. |

DbUrlList

DbUrlList is a queue backed with a database, such as MySQL, Postgres or SQLite. You can use any database engine supported by Sequelize.

If a request fails, this queue will ensure the request gets retried at some point in the future. The next request is schedule 1 hour into the future. After that, the period of delay doubles for each failure.

Options:

| Option | Description | | --- | --- | | opts.db.database | Database name. | | opts.db.username | Database username. | | opts.db.password | Database password. | | opts.db.sequelizeOpts | Options to pass to sequelize. | | opts.db.table | Table name to store URL queue. Default = 'url' | | opts.recrawlInMs | Number of milliseconds to recrawl a URL. Default = 31536000000 (1 year) |

Example usage:

new supercrawler.DbUrlList({
  db: {
    database: "crawler",
    username: "root",
    password: "password",
    sequelizeOpts: {
      dialect: "mysql",
      host: "localhost"
    }
  }
})

The following methods are available:

| Method | Description | | --- | --- | | insertIfNotExists(url) | Insert a Url object. | | upsert(url) | Upsert Url object. | | getNextUrl() | Get the next Url to be crawled. |

RedisUrlList

RedisUrlList is a queue backed with Redis.

It also balances requests between different hostnames. So, for example, if you crawl a sitemap file with 10,000 URLs, the next 10,000 URLs will not be stuck in th

Related Skills

docs-writer

99.4k

`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie

model-usage

340.2k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

arscontexta

2.9k

Claude Code plugin that generates individualized knowledge systems from conversation. You describe how you think and work, have a conversation and get a complete second brain as markdown files you own.

living-review

27 OpenClaw skills for academic research teams — literature reviews, hypothesis versioning, grant writing, lab knowledge handoffs, and more.