BR Crawl

I swear I'll do a better write up sometime soon.

Adding new URLs from a list

Duplicated feeds are ignored. Domains that block scrapping via robots.txt will be skipped.

If it's a list of urls

Any page will work, doesn't need to be the home.

Use the ExternalUrlsSpider crawler. The output will be a .jsonl file with rss_url and domain of each website listed.

# on the scraper/ directory
uv run scrapy crawl rss -a urls_file=urls.txt -o rss.jsonl

Import the resulting .jsonl file into the backend's database using the flask import-feeds command.

# on the backend/ directory
uv run flask import-feeds rss.jsonl

If it's a list of valid rss feeds

Format it to .jsonl before importing with flask import-feeds:

# on the backend/ directory
jq -R -n -c '[inputs] | map({rss_url: .}) | .[]' rss_urls.txt > rss.jsonl
uv run flask import-feeds rss.jsonl

Generating the website

Always use the full list of imported feed_urls. Order randomly to reduce chances of hammering a small provider.

# on the backend/ directory
sqlite3 brcrawl.sqlite3

.output ../website/feeds.txt
SELECT feed_url FROM feeds WHERE status_id = 1 ORDER BY RANDOM();
.output stdout
.quit

Now use the generated feeds.txt file to run the build.sh command from the website directory.

# on the website/ directory

./build.sh feeds.txt

The resulting .html files can be deployed (eg. via github pages or vps with nginx).

Recently approved feeds

Filter by status verified and apply a filter based on the date. Eg;

SELECT feed_url FROM feeds WHERE status_id = 1 AND created_at > '2026-02-12 10:00:00';

Brcrawl

Install / Use

README