Crawler
An open source example of the Count Love crawler.
Install / Use
/learn @count-love/CrawlerREADME
Count Love Crawler
Installation
To isolate the crawler and its dependencies, it is recommended that you install in a Python virtual environment.
Tested with Python 3.9 (but should be compatible with a range of versions).
Install dependencies
To install dependencies, run:
pip install -r requirements.txt
Setup SQLite database
The SQLite3 database stores the source list, crawler queue, and content extracted from pages. To create a database run:
sqlite3 data.db < schema.sql
Running crawl
To run the crawl, run:
python crawler.py
While the crawl is running, details and diagnostic information is logged to
"crawl.log". Because the Sources table is initially empty, running python crawler.py
has no effect until a source is added. Here's an example of how to add a source by
directly interacting with the database table:
sqlite3 data.db
INSERT INTO Sources VALUES (NULL, 'https://nytimes.com', 'New York, NY', 1, datetime('now'), NULL);
Rerunning python crawler.py will now print a list of potential articles with protest
keywords to the console.
