SkillAgentSearch skills...

Croquemort

A micro service to check dead links efficiently and asynchronously. In use at https://www.data.gouv.fr/

Install / Use

/learn @etalab/Croquemort
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Croquemort

Vision

The aim of this project is to provide a way to check HTTP resources: hunting 404s, updating redirections and so on.

For instance, given a website that stores a list of external resources (html, images or documents), this product allows the owner to send its URLs in bulk and retrieve information for each URL fetched in background (status code and useful headers for now). This way he can be informed of dead links or outdated resources and acts accordingly.

The name comes from the French term for Funeral director.

Language

The development language is English. All comments and documentation should be written in English, so that we don't end up with “franglais” methods, and so we can share our learnings with developers around the world.

History

We started this project on May, 2015 for data.gouv.fr.

We open-sourced it since the beginning because we want to design things in the open and involve citizens and hackers in our developments.

Installation

We’re using these technologies: RabbitMQ and Redis. You have to install and launch these dependencies prior to install and run the Python packages.

Once installed, run these commands to setup the project:

$ python3 -m venv ~/.virtualenvs/croquemort
$ source ~/.virtualenvs/croquemort/bin/activate
$ pip3 install -r requirements/develop.pip

You're good to go!

Usage

First you have to run the http service in order to receive incoming HTTP calls. You can run it with this command:

$ nameko run croquemort.http
starting services: http_server
Connected to amqp://guest:**@127.0.0.1:5672//

Then launch the crawler in a new shell that will fetch the submitted URL in the background.

$ nameko run croquemort.crawler
starting services: url_crawler
Connected to amqp://guest:**@127.0.0.1:5672//

You can optionnaly use the proposed configuration (and tweak it) to get some logs (INFO level by default):

$ nameko run --config config.yaml croquemort.crawler

You can enable in the config file more workers for the crawler (from 10 (default) to 50):

max_workers: 50

Browsing your data

At any time, you can open http://localhost:8000/ and check the availability of your URLs collections within a nice dashboard that allows you to filter by statuses, content types, URL schemes, last updates and/or domains. There is even a CSV export of the data you are currently viewing if you want to script something.

Fetching one URL

Now you can use your favorite HTTP client (mine is httpie) to issue a POST request againt localhost:8000/check/one with the URL as a parameter:

$ http :8000/check/one url="https://www.data.gouv.fr/fr/"
HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 28
Content-Type: text/plain; charset=utf-8
Date: Wed, 03 Jun 2015 14:21:50 GMT

{
  "url-hash": "u:fc6040c5"
}

The service returns a URL hash that will be used to retrieve informations related to that URL:

$ http :8000/url/u:fc6040c5
HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 335
Content-Type: text/plain; charset=utf-8
Date: Wed, 03 Jun 2015 14:22:57 GMT

{
  "etag": "",
  "checked-url": "https://www.data.gouv.fr/fr/",
  "final-url": "https://www.data.gouv.fr/fr/",
  "content-length": "",
  "content-disposition": "",
  "content-md5": "",
  "content-location": "",
  "expires": "",
  "final-status-code": "200",
  "updated": "2015-06-03T16:21:52.569974",
  "last-modified": "",
  "content-encoding": "gzip",
  "content-type": "text/html",
  "charset": "utf-8"
}

Or you can use the URL passed as a GET parameter (less error prone):

$ http GET :8000/url url=https://www.data.gouv.fr/fr/
HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 335
Content-Type: text/plain; charset=utf-8
Date: Wed, 03 Jun 2015 14:23:35 GMT

{
  "etag": "",
  "checked-url": "https://www.data.gouv.fr/fr/",
  "final-url": "https://www.data.gouv.fr/fr/",
  "content-length": "",
  "content-disposition": "",
  "content-md5": "",
  "content-location": "",
  "expires": "",
  "final-status-code": "200",
  "updated": "2015-06-03T16:21:52.569974",
  "last-modified": "",
  "content-encoding": "gzip",
  "content-type": "text/html",
  "charset": "utf-8"
}

Both return the same amount of information.

Fetching many URLs

You can also use your HTTP client to issue a POST request againt localhost:8000/check/many with the URLs and the name of the group as parameters:

$ http :8000/check/many urls:='["https://www.data.gouv.fr/fr/","https://www.data.gouv.fr/s/images/2015-03-31/d2eb53b14c5f4e6690e150ea7be40a88/cover-datafrance-retina.png"]' group="datagouvfr"
HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 30
Content-Type: text/plain; charset=utf-8
Date: Wed, 03 Jun 2015 14:24:00 GMT

{
  "group-hash": "g:efcf3897"
}

This time, the service returns a group hash that will be used to retrieve informations related to that group:

$ http :8000/group/g:efcf3897
HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 941
Content-Type: text/plain; charset=utf-8
Date: Wed, 03 Jun 2015 14:26:04 GMT

{
  "u:179d104f": {
    "content-encoding": "",
    "content-disposition": "",
    "group": "g:efcf3897",
    "last-modified": "Tue, 31 Mar 2015 14:38:37 GMT",
    "content-md5": "",
    "checked-url": "https://www.data.gouv.fr/s/images/2015-03-31/d2eb53b14c5f4e6690e150ea7be40a88/cover-datafrance-retina.png",
    "final-url": "https://www.data.gouv.fr/s/images/2015-03-31/d2eb53b14c5f4e6690e150ea7be40a88/cover-datafrance-retina.png",
    "final-status-code": "200",
    "expires": "",
    "content-type": "image/png",
    "content-length": "280919",
    "updated": "2015-06-03T16:24:00.405636",
    "etag": "\"551ab16d-44957\"",
    "content-location": ""
  },
  "name": "datagouvfr",
  "u:fc6040c5": {
    "content-disposition": "",
    "content-encoding": "gzip",
    "group": "g:efcf3897",
    "last-modified": "",
    "content-md5": "",
    "content-location": "",
    "content-length": "",
    "expires": "",
    "content-type": "text/html",
    "charset": "utf-8",
    "final-status-code": "200",
    "updated": "2015-06-03T16:24:02.398105",
    "etag": "",
    "checked-url": "https://www.data.gouv.fr/fr/"
    "final-url": "https://www.data.gouv.fr/fr/"
  }
}

Or you can use the group name passed as a GET parameter (less error prone):

$ http GET :8000/group/ group=datagouvfr
HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 335
Content-Type: text/plain; charset=utf-8
Date: Wed, 03 Jun 2015 14:23:35 GMT

{
  "etag": "",
  "checked-url": "https://www.data.gouv.fr/fr/",
  "final-url": "https://www.data.gouv.fr/fr/",
  "content-length": "",
  "content-disposition": "",
  "content-md5": "",
  "content-location": "",
  "expires": "",
  "final-status-code": "200",
  "updated": "2015-06-03T16:21:52.569974",
  "last-modified": "",
  "content-encoding": "gzip",
  "content-type": "text/html",
  "charset": "utf-8"
}

Both return the same amount of information.

Redirect handling

Both when fetching one and many urls, croquemort has basic support of HTTP redirections. First, croquemort follows eventual redirections to the final destination (allow_redirects option of the requests library). Further more, croquemort stores some information about the redirection: the first redirect code and the final url. When encountering a redirection, the JSON response looks like this (note redirect-url and redirect-status-code):

{
  "checked-url": "https://goo.gl/ovZB",
  "final-url": "http://news.ycombinator.com",
  "final-status-code": "200",
  "redirect-url": "https://goo.gl/ovZB",
  "redirect-status-code": "301",
  "etag": "",
  "content-length": "",
  "content-disposition": "",
  "content-md5": "",
  "content-location": "",
  "expires": "",
  "updated": "2015-06-03T16:21:52.569974",
  "last-modified": "",
  "content-encoding": "gzip",
  "content-type": "text/html",
  "charset": "utf-8"
}

Filtering results

You can filter results returned for a given group by header (or status) with the filter_ prefix:

$ http GET :8000/group/g:efcf3897 filter_content-type="image/png"
HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 539
Content-Type: text/plain; charset=utf-8
Date: Wed, 03 Jun 2015 14:27:07 GMT

{
  "u:179d104f": {
    "content-encoding": "",
    "content-disposition": "",
    "group": "g:efcf3897",
    "last-modified": "Tue, 31 Mar 2015 14:38:37 GMT",
    "content-md5": "",
    "checked-url": "https://www.data.gouv.fr/s/images/2015-03-31/d2eb53b14c5f4e6690e150ea7be40a88/cover-datafrance-retina.png",
    "final-url": "https://www.data.gouv.fr/s/images/2015-03-31/d2eb53b14c5f4e6690e150ea7be40a88/cover-datafrance-retina.png",
    "final-status-code": "200",
    "expires": "",
    "content-type": "image/png",
    "content-length": "280919",
    "updated": "2015-06-03T16:24:00.405636",
    "etag": "\"551ab16d-44957\"",
    "content-location": ""
  },
  "name": "datagouvfr"
}

You can exclude results returned for a given group by header (or status) with the exclude_ prefix:

$ http GET :8000/group/g:efcf3897 exclude_content-length=""
HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 539
Content-Type: text/plain; charset=utf-8
Date: Wed, 03 Jun 2015 14:27:58 GMT

{
  "u:179d104f": {
    "content-encoding": "",
    "content-disposition": "",
    "group": "g:efcf3897",
    "last-modified": "Tue, 31 Mar 2015 14:38:37 GMT",
    "content-md5": "",
    "checked-url": "https://www.data.gouv.fr/s/images/2015-03-31/d2eb53b14c5f4e6690e150ea7be40a88/cover-datafrance-retina.png",
    "final-url": "https://www.data.gouv.fr/s/images/2015-03-31/d2eb53b14c5f4e6690e150ea7be40a88/cover-datafrance-retina.png",
    "final-status-code": "200",
    "expires": "",
    "content-ty
View on GitHub
GitHub Stars35
CategoryDevelopment
Updated1y ago
Forks8

Languages

Python

Security Score

75/100

Audited on Oct 19, 2024

No findings