Silkworm
Async web scraping framework on top of Rust. Works with Free-threaded Python (`PYTHON_GIL=0`).
Install / Use
/learn @BitingSnakes/SilkwormREADME
silkworm-rs
Async-first web scraping framework built on wreq (HTTP with browser impersonation) and scraper-rs (fast HTML parsing). Silkworm gives you a minimal Spider/Request/Response model, middlewares, and pipelines so you can script quick scrapes or build larger crawlers without boilerplate.
NEW: Use silkworm-mcp to build scrapers.
Features
- Async engine with configurable concurrency, bounded queue backpressure (defaults to
concurrency * 10), and per-request timeouts. - wreq-powered HTTP client: browser impersonation, redirect following with loop detection, query merging, and proxy support via
request.meta["proxy"]. - Typed spiders and callbacks that can return items or
Requestobjects;HTMLResponseships helper methods plusResponse.followto reuse callbacks. - Middlewares: User-Agent rotation/default, proxy rotation, retry with exponential backoff + optional sleep codes, flexible delays (fixed/random/custom),
SkipNonHTMLMiddlewareto drop non-HTML callbacks, andCloudflareCrawlMiddlewarefor Browser Rendering crawl jobs. - Pipelines: JSON Lines, SQLite, XML (nested data preserved), and CSV (flattens dicts and lists) out of the box.
- Structured logging via
logly(SILKWORM_LOG_LEVEL=DEBUG), plus periodic/final crawl statistics (requests/sec, queue size, memory, seen URLs).
Installation
From PyPI with pip:
pip install silkworm-rs
From PyPI with uv (recommended for faster installs):
uv pip install silkworm-rs
# or if using uv's project management:
uv add silkworm-rs
From source:
uv venv # install uv from https://docs.astral.sh/uv/getting-started/ if needed
source .venv/bin/activate # Windows: .venv\Scripts\activate
uv pip install -e .
Targets Python 3.13+; dependencies are pinned in pyproject.toml.
Quick start
Define a spider by subclassing Spider, implementing parse, and yielding items or follow-up Request objects. This example writes quotes to data/quotes.jl and enables basic user agent, retry, and non-HTML filtering middlewares.
from silkworm import HTMLResponse, Response, Spider, run_spider
from silkworm.middlewares import (
RetryMiddleware,
SkipNonHTMLMiddleware,
UserAgentMiddleware,
)
from silkworm.pipelines import JsonLinesPipeline
class QuotesSpider(Spider):
name = "quotes"
start_urls = ("https://quotes.toscrape.com/",)
async def parse(self, response: Response):
if not isinstance(response, HTMLResponse):
return
html = response
for quote in await html.select(".quote"):
text_el = await quote.select_first(".text")
author_el = await quote.select_first(".author")
if text_el is None or author_el is None:
continue
tags = await quote.select(".tag")
yield {
"text": text_el.text,
"author": author_el.text,
"tags": [t.text for t in tags],
}
if next_link := await html.select_first("li.next > a"):
yield html.follow(next_link.attr("href"), callback=self.parse)
if __name__ == "__main__":
run_spider(
QuotesSpider,
request_middlewares=[UserAgentMiddleware()],
response_middlewares=[
SkipNonHTMLMiddleware(),
RetryMiddleware(max_times=3, sleep_http_codes=[429, 503]),
],
item_pipelines=[JsonLinesPipeline("data/quotes.jl")],
concurrency=16,
request_timeout=10,
log_stats_interval=30,
)
run_spider/crawl knobs:
concurrency: number of concurrent HTTP requests; default 16.max_pending_requests: queue bound to avoid unbounded memory use (defaults toconcurrency * 10).request_timeout: per-request timeout (seconds).keep_alive: reuse HTTP connections when supported by the underlying client (sendsConnection: keep-alive).html_max_size_bytes: limit HTML parsed intoAsyncDocumentto avoid huge payloads.log_stats_interval: seconds between periodic stats logs; final stats are always emitted.request_middlewares/response_middlewares/item_pipelines: plug-ins run on every request/response/item.- use
run_spider_rsloop(...)instead ofrun_spider(...)to run under rsloop (requirespip install silkworm-rs[rsloop]). - use
run_spider_uvloop(...)instead ofrun_spider(...)to run under uvloop (requirespip install silkworm-rs[uvloop]). - use
run_spider_winloop(...)instead ofrun_spider(...)to run under winloop on Windows (requirespip install silkworm-rs[winloop]).
Built-in middlewares and pipelines
from silkworm.middlewares import (
CloudflareCrawlMiddleware,
DelayMiddleware,
ProxyMiddleware,
RetryMiddleware,
SkipNonHTMLMiddleware,
UserAgentMiddleware,
)
from silkworm.pipelines import (
CallbackPipeline, # invoke a custom callback function on each item
CSVPipeline,
JsonLinesPipeline,
MsgPackPipeline, # requires: pip install silkworm-rs[msgpack]
SQLitePipeline,
XMLPipeline,
TaskiqPipeline, # requires: pip install silkworm-rs[taskiq]
PolarsPipeline, # requires: pip install silkworm-rs[polars]
ExcelPipeline, # requires: pip install silkworm-rs[excel]
YAMLPipeline, # requires: pip install silkworm-rs[yaml]
AvroPipeline, # requires: pip install silkworm-rs[avro]
ElasticsearchPipeline, # requires: pip install silkworm-rs[elasticsearch]
MongoDBPipeline, # requires: pip install silkworm-rs[mongodb]
MySQLPipeline, # requires: pip install silkworm-rs[mysql]
PostgreSQLPipeline, # requires: pip install silkworm-rs[postgresql]
S3JsonLinesPipeline, # requires: pip install silkworm-rs[s3]
VortexPipeline, # requires: pip install silkworm-rs[vortex]
WebhookPipeline, # sends items to webhook endpoints using wreq
GoogleSheetsPipeline, # requires: pip install silkworm-rs[gsheets]
SnowflakePipeline, # requires: pip install silkworm-rs[snowflake]
FTPPipeline, # requires: pip install silkworm-rs[ftp]
SFTPPipeline, # requires: pip install silkworm-rs[sftp]
CassandraPipeline, # requires: pip install silkworm-rs[cassandra]
CouchDBPipeline, # requires: pip install silkworm-rs[couchdb]
DynamoDBPipeline, # requires: pip install silkworm-rs[dynamodb]
DuckDBPipeline, # requires: pip install silkworm-rs[duckdb]
)
run_spider(
QuotesSpider,
request_middlewares=[
UserAgentMiddleware(), # rotate/custom user agent
DelayMiddleware(min_delay=0.3, max_delay=1.2), # polite throttling
# ProxyMiddleware with round-robin selection (default)
# ProxyMiddleware(proxies=["http://user:pass@proxy1:8080", "http://proxy2:8080"]),
# ProxyMiddleware with random selection
# ProxyMiddleware(proxies=["http://proxy1:8080", "http://proxy2:8080"], random_selection=True),
# ProxyMiddleware from file with random selection
# ProxyMiddleware(proxy_file="proxies.txt", random_selection=True),
],
response_middlewares=[
RetryMiddleware(max_times=3, sleep_http_codes=[403, 429]), # backoff + retry
SkipNonHTMLMiddleware(), # drop callbacks for images/APIs/etc
],
item_pipelines=[
JsonLinesPipeline("data/quotes.jl"),
SQLitePipeline("data/quotes.db", table="quotes"),
XMLPipeline("data/quotes.xml", root_element="quotes", item_element="quote"),
CSVPipeline("data/quotes.csv", fieldnames=["author", "text", "tags"]),
MsgPackPipeline("data/quotes.msgpack"),
],
)
DelayMiddlewarestrategies:delay=1.0(fixed),min_delay/max_delay(random), ordelay_func(custom).ProxyMiddlewaresupports three modes:- Round-robin (default):
ProxyMiddleware(proxies=["http://proxy1:8080", "http://proxy2:8080"])cycles through proxies in order. - Random selection:
ProxyMiddleware(proxies=["http://proxy1:8080", "http://proxy2:8080"], random_selection=True)randomly selects a proxy for each request. - From file:
ProxyMiddleware(proxy_file="proxies.txt")loads proxies from a file (one proxy per line, blank lines ignored). Combine withrandom_selection=Truefor random selection from the file.
- Round-robin (default):
RetryMiddlewarebacks off withasyncio.sleep; any status insleep_http_codesis retried even if not inretry_http_codes.SkipNonHTMLMiddlewarechecksContent-Typeand optionally sniffs the body (sniff_bytes) to avoid running HTML callbacks on binary/API responses.CloudflareCrawlMiddlewareis opt-in per request viarequest.meta["cloudflare_crawl"]; it submits a Cloudflare Browser Rendering crawl job, polls until completion, and hands your callback a synthetic JSONResponsewith the final API payload.JsonLinesPipelinewrites items to a local JSON Lines file and, whenopendalis installed, appends asynchronously via the filesystem backend (use_opendal=Falseto stick to a regular file handle).CSVPipelineflattens nested dicts (e.g.,{"user": {"name": "Alice"}}->user_name) and joins lists with commas;XMLPipelinepreserves nesting.MsgPackPipelinewrites items in binary MessagePack format using ormsgpack for fast and compact serialization (requirespip install silkworm-rs[msgpack]).TaskiqPipelinesends items to a Taskiq q
