Croquemort
A micro service to check dead links efficiently and asynchronously. In use at https://www.data.gouv.fr/
Install / Use
/learn @etalab/CroquemortREADME
Croquemort
Vision
The aim of this project is to provide a way to check HTTP resources: hunting 404s, updating redirections and so on.
For instance, given a website that stores a list of external resources (html, images or documents), this product allows the owner to send its URLs in bulk and retrieve information for each URL fetched in background (status code and useful headers for now). This way he can be informed of dead links or outdated resources and acts accordingly.
The name comes from the French term for Funeral director.
Language
The development language is English. All comments and documentation should be written in English, so that we don't end up with “franglais” methods, and so we can share our learnings with developers around the world.
History
We started this project on May, 2015 for data.gouv.fr.
We open-sourced it since the beginning because we want to design things in the open and involve citizens and hackers in our developments.
Installation
We’re using these technologies: RabbitMQ and Redis. You have to install and launch these dependencies prior to install and run the Python packages.
Once installed, run these commands to setup the project:
$ python3 -m venv ~/.virtualenvs/croquemort
$ source ~/.virtualenvs/croquemort/bin/activate
$ pip3 install -r requirements/develop.pip
You're good to go!
Usage
First you have to run the http service in order to receive incoming HTTP calls. You can run it with this command:
$ nameko run croquemort.http
starting services: http_server
Connected to amqp://guest:**@127.0.0.1:5672//
Then launch the crawler in a new shell that will fetch the submitted URL in the background.
$ nameko run croquemort.crawler
starting services: url_crawler
Connected to amqp://guest:**@127.0.0.1:5672//
You can optionnaly use the proposed configuration (and tweak it) to get some logs (INFO level by default):
$ nameko run --config config.yaml croquemort.crawler
You can enable in the config file more workers for the crawler (from 10 (default) to 50):
max_workers: 50
Browsing your data
At any time, you can open http://localhost:8000/ and check the availability of your URLs collections within a nice dashboard that allows you to filter by statuses, content types, URL schemes, last updates and/or domains. There is even a CSV export of the data you are currently viewing if you want to script something.
Fetching one URL
Now you can use your favorite HTTP client (mine is httpie) to issue a POST request againt localhost:8000/check/one with the URL as a parameter:
$ http :8000/check/one url="https://www.data.gouv.fr/fr/"
HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 28
Content-Type: text/plain; charset=utf-8
Date: Wed, 03 Jun 2015 14:21:50 GMT
{
"url-hash": "u:fc6040c5"
}
The service returns a URL hash that will be used to retrieve informations related to that URL:
$ http :8000/url/u:fc6040c5
HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 335
Content-Type: text/plain; charset=utf-8
Date: Wed, 03 Jun 2015 14:22:57 GMT
{
"etag": "",
"checked-url": "https://www.data.gouv.fr/fr/",
"final-url": "https://www.data.gouv.fr/fr/",
"content-length": "",
"content-disposition": "",
"content-md5": "",
"content-location": "",
"expires": "",
"final-status-code": "200",
"updated": "2015-06-03T16:21:52.569974",
"last-modified": "",
"content-encoding": "gzip",
"content-type": "text/html",
"charset": "utf-8"
}
Or you can use the URL passed as a GET parameter (less error prone):
$ http GET :8000/url url=https://www.data.gouv.fr/fr/
HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 335
Content-Type: text/plain; charset=utf-8
Date: Wed, 03 Jun 2015 14:23:35 GMT
{
"etag": "",
"checked-url": "https://www.data.gouv.fr/fr/",
"final-url": "https://www.data.gouv.fr/fr/",
"content-length": "",
"content-disposition": "",
"content-md5": "",
"content-location": "",
"expires": "",
"final-status-code": "200",
"updated": "2015-06-03T16:21:52.569974",
"last-modified": "",
"content-encoding": "gzip",
"content-type": "text/html",
"charset": "utf-8"
}
Both return the same amount of information.
Fetching many URLs
You can also use your HTTP client to issue a POST request againt localhost:8000/check/many with the URLs and the name of the group as parameters:
$ http :8000/check/many urls:='["https://www.data.gouv.fr/fr/","https://www.data.gouv.fr/s/images/2015-03-31/d2eb53b14c5f4e6690e150ea7be40a88/cover-datafrance-retina.png"]' group="datagouvfr"
HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 30
Content-Type: text/plain; charset=utf-8
Date: Wed, 03 Jun 2015 14:24:00 GMT
{
"group-hash": "g:efcf3897"
}
This time, the service returns a group hash that will be used to retrieve informations related to that group:
$ http :8000/group/g:efcf3897
HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 941
Content-Type: text/plain; charset=utf-8
Date: Wed, 03 Jun 2015 14:26:04 GMT
{
"u:179d104f": {
"content-encoding": "",
"content-disposition": "",
"group": "g:efcf3897",
"last-modified": "Tue, 31 Mar 2015 14:38:37 GMT",
"content-md5": "",
"checked-url": "https://www.data.gouv.fr/s/images/2015-03-31/d2eb53b14c5f4e6690e150ea7be40a88/cover-datafrance-retina.png",
"final-url": "https://www.data.gouv.fr/s/images/2015-03-31/d2eb53b14c5f4e6690e150ea7be40a88/cover-datafrance-retina.png",
"final-status-code": "200",
"expires": "",
"content-type": "image/png",
"content-length": "280919",
"updated": "2015-06-03T16:24:00.405636",
"etag": "\"551ab16d-44957\"",
"content-location": ""
},
"name": "datagouvfr",
"u:fc6040c5": {
"content-disposition": "",
"content-encoding": "gzip",
"group": "g:efcf3897",
"last-modified": "",
"content-md5": "",
"content-location": "",
"content-length": "",
"expires": "",
"content-type": "text/html",
"charset": "utf-8",
"final-status-code": "200",
"updated": "2015-06-03T16:24:02.398105",
"etag": "",
"checked-url": "https://www.data.gouv.fr/fr/"
"final-url": "https://www.data.gouv.fr/fr/"
}
}
Or you can use the group name passed as a GET parameter (less error prone):
$ http GET :8000/group/ group=datagouvfr
HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 335
Content-Type: text/plain; charset=utf-8
Date: Wed, 03 Jun 2015 14:23:35 GMT
{
"etag": "",
"checked-url": "https://www.data.gouv.fr/fr/",
"final-url": "https://www.data.gouv.fr/fr/",
"content-length": "",
"content-disposition": "",
"content-md5": "",
"content-location": "",
"expires": "",
"final-status-code": "200",
"updated": "2015-06-03T16:21:52.569974",
"last-modified": "",
"content-encoding": "gzip",
"content-type": "text/html",
"charset": "utf-8"
}
Both return the same amount of information.
Redirect handling
Both when fetching one and many urls, croquemort has basic support of HTTP redirections. First, croquemort follows eventual redirections to the final destination (allow_redirects option of the requests library). Further more, croquemort stores some information about the redirection: the first redirect code and the final url. When encountering a redirection, the JSON response looks like this (note redirect-url and redirect-status-code):
{
"checked-url": "https://goo.gl/ovZB",
"final-url": "http://news.ycombinator.com",
"final-status-code": "200",
"redirect-url": "https://goo.gl/ovZB",
"redirect-status-code": "301",
"etag": "",
"content-length": "",
"content-disposition": "",
"content-md5": "",
"content-location": "",
"expires": "",
"updated": "2015-06-03T16:21:52.569974",
"last-modified": "",
"content-encoding": "gzip",
"content-type": "text/html",
"charset": "utf-8"
}
Filtering results
You can filter results returned for a given group by header (or status) with the filter_ prefix:
$ http GET :8000/group/g:efcf3897 filter_content-type="image/png"
HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 539
Content-Type: text/plain; charset=utf-8
Date: Wed, 03 Jun 2015 14:27:07 GMT
{
"u:179d104f": {
"content-encoding": "",
"content-disposition": "",
"group": "g:efcf3897",
"last-modified": "Tue, 31 Mar 2015 14:38:37 GMT",
"content-md5": "",
"checked-url": "https://www.data.gouv.fr/s/images/2015-03-31/d2eb53b14c5f4e6690e150ea7be40a88/cover-datafrance-retina.png",
"final-url": "https://www.data.gouv.fr/s/images/2015-03-31/d2eb53b14c5f4e6690e150ea7be40a88/cover-datafrance-retina.png",
"final-status-code": "200",
"expires": "",
"content-type": "image/png",
"content-length": "280919",
"updated": "2015-06-03T16:24:00.405636",
"etag": "\"551ab16d-44957\"",
"content-location": ""
},
"name": "datagouvfr"
}
You can exclude results returned for a given group by header (or status) with the exclude_ prefix:
$ http GET :8000/group/g:efcf3897 exclude_content-length=""
HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 539
Content-Type: text/plain; charset=utf-8
Date: Wed, 03 Jun 2015 14:27:58 GMT
{
"u:179d104f": {
"content-encoding": "",
"content-disposition": "",
"group": "g:efcf3897",
"last-modified": "Tue, 31 Mar 2015 14:38:37 GMT",
"content-md5": "",
"checked-url": "https://www.data.gouv.fr/s/images/2015-03-31/d2eb53b14c5f4e6690e150ea7be40a88/cover-datafrance-retina.png",
"final-url": "https://www.data.gouv.fr/s/images/2015-03-31/d2eb53b14c5f4e6690e150ea7be40a88/cover-datafrance-retina.png",
"final-status-code": "200",
"expires": "",
"content-ty
