silkworm-rs

Async-first web scraping framework built on wreq (HTTP with browser impersonation) and scraper-rs (fast HTML parsing). Silkworm gives you a minimal Spider/Request/Response model, middlewares, and pipelines so you can script quick scrapes or build larger crawlers without boilerplate.

NEW: Use silkworm-mcp to build scrapers.

Features

Async engine with configurable concurrency, bounded queue backpressure (defaults to concurrency * 10), and per-request timeouts.
wreq-powered HTTP client: browser impersonation, redirect following with loop detection, query merging, and proxy support via request.meta["proxy"].
Typed spiders and callbacks that can return items or Request objects; HTMLResponse ships helper methods plus Response.follow to reuse callbacks.
Middlewares: User-Agent rotation/default, proxy rotation, retry with exponential backoff + optional sleep codes, flexible delays (fixed/random/custom), SkipNonHTMLMiddleware to drop non-HTML callbacks, and CloudflareCrawlMiddleware for Browser Rendering crawl jobs.
Pipelines: JSON Lines, SQLite, XML (nested data preserved), and CSV (flattens dicts and lists) out of the box.
Structured logging via logly (SILKWORM_LOG_LEVEL=DEBUG), plus periodic/final crawl statistics (requests/sec, queue size, memory, seen URLs).

Installation

From PyPI with pip:

pip install silkworm-rs

From PyPI with uv (recommended for faster installs):

uv pip install silkworm-rs
# or if using uv's project management:
uv add silkworm-rs

From source:

uv venv  # install uv from https://docs.astral.sh/uv/getting-started/ if needed
source .venv/bin/activate  # Windows: .venv\Scripts\activate
uv pip install -e .

Targets Python 3.13+; dependencies are pinned in pyproject.toml.

Quick start

Define a spider by subclassing Spider, implementing parse, and yielding items or follow-up Request objects. This example writes quotes to data/quotes.jl and enables basic user agent, retry, and non-HTML filtering middlewares.

from silkworm import HTMLResponse, Response, Spider, run_spider
from silkworm.middlewares import (
    RetryMiddleware,
    SkipNonHTMLMiddleware,
    UserAgentMiddleware,
)
from silkworm.pipelines import JsonLinesPipeline


class QuotesSpider(Spider):
    name = "quotes"
    start_urls = ("https://quotes.toscrape.com/",)

    async def parse(self, response: Response):
        if not isinstance(response, HTMLResponse):
            return

        html = response
        for quote in await html.select(".quote"):
            text_el = await quote.select_first(".text")
            author_el = await quote.select_first(".author")
            if text_el is None or author_el is None:
                continue
            tags = await quote.select(".tag")
            yield {
                "text": text_el.text,
                "author": author_el.text,
                "tags": [t.text for t in tags],
            }

        if next_link := await html.select_first("li.next > a"):
            yield html.follow(next_link.attr("href"), callback=self.parse)


if __name__ == "__main__":
    run_spider(
        QuotesSpider,
        request_middlewares=[UserAgentMiddleware()],
        response_middlewares=[
            SkipNonHTMLMiddleware(),
            RetryMiddleware(max_times=3, sleep_http_codes=[429, 503]),
        ],
        item_pipelines=[JsonLinesPipeline("data/quotes.jl")],
        concurrency=16,
        request_timeout=10,
        log_stats_interval=30,
    )

run_spider/crawl knobs:

concurrency: number of concurrent HTTP requests; default 16.
max_pending_requests: queue bound to avoid unbounded memory use (defaults to concurrency * 10).
request_timeout: per-request timeout (seconds).
keep_alive: reuse HTTP connections when supported by the underlying client (sends Connection: keep-alive).
html_max_size_bytes: limit HTML parsed into AsyncDocument to avoid huge payloads.
log_stats_interval: seconds between periodic stats logs; final stats are always emitted.
request_middlewares / response_middlewares / item_pipelines: plug-ins run on every request/response/item.
use run_spider_rsloop(...) instead of run_spider(...) to run under rsloop (requires pip install silkworm-rs[rsloop]).
use run_spider_uvloop(...) instead of run_spider(...) to run under uvloop (requires pip install silkworm-rs[uvloop]).
use run_spider_winloop(...) instead of run_spider(...) to run under winloop on Windows (requires pip install silkworm-rs[winloop]).

Built-in middlewares and pipelines

from silkworm.middlewares import (
    CloudflareCrawlMiddleware,
    DelayMiddleware,
    ProxyMiddleware,
    RetryMiddleware,
    SkipNonHTMLMiddleware,
    UserAgentMiddleware,
)
from silkworm.pipelines import (
    CallbackPipeline,  # invoke a custom callback function on each item
    CSVPipeline,
    JsonLinesPipeline,
    MsgPackPipeline,  # requires: pip install silkworm-rs[msgpack]
    SQLitePipeline,
    XMLPipeline,
    TaskiqPipeline,  # requires: pip install silkworm-rs[taskiq]
    PolarsPipeline,  # requires: pip install silkworm-rs[polars]
    ExcelPipeline,  # requires: pip install silkworm-rs[excel]
    YAMLPipeline,  # requires: pip install silkworm-rs[yaml]
    AvroPipeline,  # requires: pip install silkworm-rs[avro]
    ElasticsearchPipeline,  # requires: pip install silkworm-rs[elasticsearch]
    MongoDBPipeline,  # requires: pip install silkworm-rs[mongodb]
    MySQLPipeline,  # requires: pip install silkworm-rs[mysql]
    PostgreSQLPipeline,  # requires: pip install silkworm-rs[postgresql]
    S3JsonLinesPipeline,  # requires: pip install silkworm-rs[s3]
    VortexPipeline,  # requires: pip install silkworm-rs[vortex]
    WebhookPipeline,  # sends items to webhook endpoints using wreq
    GoogleSheetsPipeline,  # requires: pip install silkworm-rs[gsheets]
    SnowflakePipeline,  # requires: pip install silkworm-rs[snowflake]
    FTPPipeline,  # requires: pip install silkworm-rs[ftp]
    SFTPPipeline,  # requires: pip install silkworm-rs[sftp]
    CassandraPipeline,  # requires: pip install silkworm-rs[cassandra]
    CouchDBPipeline,  # requires: pip install silkworm-rs[couchdb]
    DynamoDBPipeline,  # requires: pip install silkworm-rs[dynamodb]
    DuckDBPipeline,  # requires: pip install silkworm-rs[duckdb]
)

run_spider(
    QuotesSpider,
    request_middlewares=[
        UserAgentMiddleware(),  # rotate/custom user agent
        DelayMiddleware(min_delay=0.3, max_delay=1.2),  # polite throttling
        # ProxyMiddleware with round-robin selection (default)
        # ProxyMiddleware(proxies=["http://user:pass@proxy1:8080", "http://proxy2:8080"]),
        # ProxyMiddleware with random selection
        # ProxyMiddleware(proxies=["http://proxy1:8080", "http://proxy2:8080"], random_selection=True),
        # ProxyMiddleware from file with random selection
        # ProxyMiddleware(proxy_file="proxies.txt", random_selection=True),
    ],
    response_middlewares=[
        RetryMiddleware(max_times=3, sleep_http_codes=[403, 429]),  # backoff + retry
        SkipNonHTMLMiddleware(),  # drop callbacks for images/APIs/etc
    ],
    item_pipelines=[
        JsonLinesPipeline("data/quotes.jl"),
        SQLitePipeline("data/quotes.db", table="quotes"),
        XMLPipeline("data/quotes.xml", root_element="quotes", item_element="quote"),
        CSVPipeline("data/quotes.csv", fieldnames=["author", "text", "tags"]),
        MsgPackPipeline("data/quotes.msgpack"),
    ],
)

DelayMiddleware strategies: delay=1.0 (fixed), min_delay/max_delay (random), or delay_func (custom).
ProxyMiddleware supports three modes:
- Round-robin (default): ProxyMiddleware(proxies=["http://proxy1:8080", "http://proxy2:8080"]) cycles through proxies in order.
- Random selection: ProxyMiddleware(proxies=["http://proxy1:8080", "http://proxy2:8080"], random_selection=True) randomly selects a proxy for each request.
- From file: ProxyMiddleware(proxy_file="proxies.txt") loads proxies from a file (one proxy per line, blank lines ignored). Combine with random_selection=True for random selection from the file.
RetryMiddleware backs off with asyncio.sleep; any status in sleep_http_codes is retried even if not in retry_http_codes.
SkipNonHTMLMiddleware checks Content-Type and optionally sniffs the body (sniff_bytes) to avoid running HTML callbacks on binary/API responses.
CloudflareCrawlMiddleware is opt-in per request via request.meta["cloudflare_crawl"]; it submits a Cloudflare Browser Rendering crawl job, polls until completion, and hands your callback a synthetic JSON Response with the final API payload.
JsonLinesPipeline writes items to a local JSON Lines file and, when opendal is installed, appends asynchronously via the filesystem backend (use_opendal=False to stick to a regular file handle).
CSVPipeline flattens nested dicts (e.g., {"user": {"name": "Alice"}} -> user_name) and joins lists with commas; XMLPipeline preserves nesting.
MsgPackPipeline writes items in binary MessagePack format using ormsgpack for fast and compact serialization (requires pip install silkworm-rs[msgpack]).
TaskiqPipeline sends items to a Taskiq q

Silkworm

Install / Use

README

silkworm-rs

Features

Installation

Quick start

Built-in middlewares and pipelines