Escargot - a Symfony HttpClient based Crawler framework

A library that provides everything you need to crawl anything based on HTTP and process the responses in whatever way you prefer based on Symfony components.

Why yet another crawler?

There are so many different implementations in so many programming languages, right? Well, the ones I found in PHP did not really live up to my personal quality standards and also I wanted something that's built on top of the [Symfony HttpClient][Symfony_HTTPClient] component and is not bound to crawl websites (HTML) only but can be used as the foundation for anything you may want to crawl. Hence, yet another library.

What about that name «Escargot»?

When I created this library I didn't want to name it «crawler» or «spider» or anything similar that's been used hundreds of times before. So I started to think about things that actually crawl and one thing that came to my mind immediately were snails. But «snail» doesn't really sound super beautiful and so I just went with the French translation for it which is «escargot». There you go! Also French is a beautiful language anyway and in case you didn't know: tons of libraries in the PHP ecosystem were invented and are still maintained by French people so it's also some kind of tribute to the French PHP community (and Symfony one for that matter).

By the way: Thanks to the Symfony HttpClient Escargot is actually not slow at all ;-)

Installation

composer require terminal42/escargot

Usage

Everything in Escargot is assigned to a job ID. The reason for this design is that crawling huge amounts of URIs can take very long and chances that you'll want to stop at some point and pick up where you left are pretty high. For that matter, every Escargot instance also needs a queue plus a base URI collection as to where to start crawling.

Instantiating Escargot

The factory method when you do not have a job ID yet has to be used as follows:

<?php

use Nyholm\Psr7\Uri;
use Terminal42\Escargot\BaseUriCollection;
use Terminal42\Escargot\Escargot;
use Terminal42\Escargot\Queue\InMemoryQueue;

$baseUris = new BaseUriCollection();
$baseUris->add(new Uri('https://www.terminal42.ch'));
$queue = new InMemoryQueue();
        
$escargot = Escargot::create($baseUris, $queue);

In case you already do have a job ID because you have initiated crawling previously we do not need any base URI collection anymore but the job ID instead (again $client is completely optional):

<?php

use Symfony\Component\HttpClient\CurlHttpClient;
use Terminal42\Escargot\Escargot;
use Terminal42\Escargot\Queue\InMemoryQueue;

$queue = new InMemoryQueue();
        
$escargot = Escargot::createFromJobId($jobId, $queue);

The different queue implementations

As explained before, the queue is an essential part of Escargot because it keeps track of all the URIs that have been requested already but it is also responsible to pick up where one left based on a given job ID. You can create your own queue and store the information wherever you like by implementing the QueueInterface. This library ships with the following implementations for you to use:

InMemoryQueue - an in-memory queue. Mostly useful for testing or CLI usages. Once the process ends, data will be lost.
DoctrineQueue - a Doctrine DBAL queue. Stores the data in your Doctrine/PDO compatible database so it's persistent.
LazyQueue - a queue that takes two QueueInterface implementations as arguments. It will try to work on the primary queue as long as possible and fall back to the second queue only if needed. The result can be transferred from the first queue to the second queue by using the commit() method. The use case is mainly to prevent e.g. the database from being hammered by using $queue = new LazyQueue(new InMemoryQueue(), new DoctrineQueue()). That way you get persistence (by calling $queue->commit($jobId) once done) combined with efficiency.

Start crawling

After we have our Escargot instance, we can start crawling which we do by calling the crawl() method:

<?php

$escargot->crawl();

Subscribers

You might be wondering how you can access the results of your crawl process. In Escargot, crawl() does not return anything but instead, everything is passed on to subscribers which lets you decide exactly on what you want to do with the results that are collected along the way. The flow of every request executed by Escargot is as follows which maps to the corresponding methods in the subscribers:

Decide whether a request should be sent at all (if no subscriber requests the request, none is executed):

SubscriberInterface:shouldRequest()
If a request was sent, wait for the first response chunk and decide whether the whole response body should be loaded:

SubscriberInterface:needsContent()
If the body was requested, the data is passed on to the subscribers on the last response chunk that arrives:

SubscriberInterface:onLastChunk()

Adding a subscriber is accomplished by implementing the SubscriberInterface and registering it using Escargot::addSubscriber():

<?php

$escargot->addSubscriber(new MySubscriber());

According to the flow of every request, the SubscriberInterface asks you to implement 3 methods:

shouldRequest(CrawlUri $crawlUri, string $currentDecision): string;

This method is called before a request is executed. Note that the logic is inverted: If none of the registered subscribers asks Escargot to execute the request, it's not going to be requested. That allows for a lot more flexibility. If it was the other way around, one subscriber could cancel the request and thus cause another subscriber to not get any results. You may return one of the following 3 constants:
- SubscriberInterface::DECISION_POSITIVE
  
  Returning a positive decision will cause the request to be executed no matter what other subscribers return. It will also cause needsContent() to be called on this subscriber.
- SubscriberInterface::DECISION_ABSTAIN
  
  Returning an abstain decision will not cause the request to be executed. However, if any other subscriber returns a positive decision, needsContent() will still be called on this subscriber.
- SubscriberInterface::DECISION_NEGATIVE
  
  Returning a negative decision will make sure, needsContent() is not called on this subscriber, no matter whether another subscriber returns a positive decision thus causing the request to be executed.
needsContent(CrawlUri $crawlUri, ResponseInterface $response, ChunkInterface $chunk, string $currentDecision): string;

This method is called on the first chunk of a request that arrives. You have access to all the headers now but the content of the response might not yet be loaded. Note that the logic is inverted: If none of the registered subscribers asks Escargot to provide the content, it's going to cancel the request and thus early abort. Again you may return one of the following 3 constants:
- SubscriberInterface::DECISION_POSITIVE
  
  Returning a positive decision will cause the request to be finished (whole response content is loaded) no matter what other subscribers return. It will also cause onLastChunk() to be called on this subscriber.
- SubscriberInterface::DECISION_ABSTAIN
  
  Returning an abstain decision will not cause the request to be finished. However, if any other subscriber returns a positive decision, onLastChunk() will still be called on this subscriber.
- SubscriberInterface::DECISION_NEGATIVE
  
  Returning a negative decision will make sure, onLastChunk() is not called on this subscriber, no matter whether another subscriber returns a positive decision thus causing the request to be completed.
onLastChunk(CrawlUri $crawlUri, ResponseInterface $response, ChunkInterface $chunk): void;

If one of the subscribers returned a positive decision during the needsContent() phase, all of the subscribers that either abstained or replied positively during the needsContent() phase, receive the content of the response.

There are 2 other interfaces which you might want to integrate but you don't have to:

ExceptionSubscriberInterface

There are two methods to implement in this interface:

onTransportException(CrawlUri $crawlUri, ExceptionInterface $exception, ResponseInterface $response): void;

These type of exceptions are thrown if there's something wrong with the transport (timeouts etc.).

onHttpException(CrawlUri $crawlUri, ExceptionInterface $exception, ResponseInterface $response, ChunkInterface $chunk): void;

These type of exceptions are thrown if the status code was in the 300-599 range.

For more information, also see the [Symfony HttpClient docs][Symfony_HTTPClient].
FinishedCrawlingSubscriberInterface

There's only one method to implement in this interface:

finishedCrawling(): void;

Once crawling is finished (that does not mean there's no pending queue items, you may also have reached the maximum number of requests), all subscribers implementing this interface will be called.

Escargot

Install / Use

README