Tarantula
Another PHP crawler based on Guzzle.
Install / Use
/learn @mihaeu/TarantulaREADME
Tarantula
Tarantula is a web crawler written in PHP. It utilizes the amazing work of the people behind Guzzle and Symfony's DomCrawler.
Installation
Global tool
Make sure ~/.composer/bin is in your $PATH and then simply execute:
composer global require mihaeu/tarantula:1.*
Library
Assuming you are using Composer, add the following to your composer.json file:
{
"require": {
"mihaeu/tarantula": "1.*"
}
}
or use Composer's cli tool composer require mihaeu/tarantula:1.*.
Usage
Global tool
Right now the only command available is crawl. Some usage examples would be:
# most basic use case
tarantula crawl "http://google.com"
# go deeper
tarantula crawl "http://products.com/categories" --depth=4
# mirror
tarantula crawl "http://myblog.com" --mirror=/tmp/blog-backup
# filters
tarantula crawl "http://myblog.com" --contains=yolo
tarantula crawl "http://myblog.com" --regex="(post)\|(\d+)"
# dump crawled file in hashed files
tarantula crawl "http://myblog.com" --save-hashed=/tmp/blog-backup --minify-html
# HTTP basic auth
tarantula crawl "http://secure.com" --user=admin --password=admin
# search for "Avatar" on imdb
bin/tarantula crawl "http://www.imdb.com/find?q=avatar&s=all" --depth=0 --quiet --css=".findSection td.result_text"
# today's weather in seattle
bin/tarantula crawl --depth=0 "http://www.weather.com/weather/today/Seattle+WA+USWA0395:1:US" --css=".wx-first" | head -n 2
For all arguments and options use the help command:
tarantula help # displays all available commands
tarantula help crawl # all arguments and options for the crawler
tarantula crawl "..." --verbose # switch on debugging output
Library
Have a look at the tests to see what's possible or just try the following in your code:
use Mihaeu\Tarantula\Crawler;
use Mihaeu\Tarantula\HttpClient;
$crawler = new Crawler(new HttpClient('http://google.com'));
$links = $crawler->go(1);
All HTTP requests go through Guzzle and you can add any configuration for Guzzle's request object also to Tarantula's HttpClient.
Tests
Test coverage is not at 100%, the reason being that this was an afternoon project and testing a crawler takes a lot of time due to the testing setup.
If you want to get a quick overview of the project, I recommend running the test suite with the --testdox flag:
vendor/bin/phpunit --testdox
To Do
- [ ] filters (url, filetype, etc.)
- [ ] allow for Guzzle to be configured via command line
- [ ] more actions (save plain result, crawl via DOM/XPath, ...)
Troubleshooting
Composer global install fails
This is most likely due to a conflict with some requirements of other global installs. Unfortunately Composer's architecture doesn't offer a solution for this yet. I tried to keep the requirements Tarantula loose to avoid this problem.
If you want to have Tarantula available throughout your system, just install to another directory (e.g. using composer create-project) and symlink bin/tarantula into a folder in your $PATH.
Thanks to
- Symfony/SensioLabs and especially Fabien Potencier for what he does for PHP (for this particular project the DomCrawler)
- the Guzzle team for their awesome HTTP client
- Aha Soft for the logo
- the Composer team for revolutionizing the way I and many others write PHP
- GitHub for redefining collaboration
- Travis CI for improving the quality and compatibility of thousands of open source projects
- Sebastian Bergmann for PHPUnit and many other awesome QA tools
License
MIT, see LICENSE file.
Related Skills
node-connect
348.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
108.9kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
348.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
348.2kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。


