HQuery.php
An extremely fast web scraper that parses megabytes of invalid HTML in a blink of an eye. PHP5.3+, no dependencies.
Install / Use
/learn @duzun/HQuery.phpREADME
hQuery.php 
An extremely fast and efficient web scraper that can parse megabytes of invalid HTML in a blink of an eye.
You can use the familiar jQuery/CSS selector syntax to easily find the data you need.
In my unit tests, I demand it be at least 10 times faster than Symfony's DOMCrawler on a 3Mb HTML document. In reality, according to my humble tests, it is two-three orders of magnitude faster than DOMCrawler in some cases, especially when selecting thousands of elements, and on average uses x2 less RAM.
See tests/README.md.
💡 Features
- Very fast parsing and lookup
- Parses broken HTML
- jQuery-like style of DOM traversal
- Low memory usage
- Can handle big HTML documents (I have tested up to 20Mb, but the limit is the amount of RAM you have)
- Doesn't require cURL to be installed and automatically handles redirects (see hQuery::fromUrl())
- Caches response for multiple processing tasks
- PSR-7 friendly (see hQuery::fromHTML($message))
- PHP 5.3+
- No dependencies
Requirements
- PHP 5.3 or newer (PHP 7.4+ recommended)
mbstringextension is recommended for reliable charset handling and conversions- Ensure a sufficient
memory_limitwhen working with very large documents
🛠 Install
Add the library to your project and include it, or install via Composer/npm.
Using Composer (recommended):
composer require duzun/hquery
Or include manually:
include_once '/path/to/hquery.php/hquery.php';
Or via npm:
npm install hquery.php
Then require the file from node_modules if needed.
⚙ Usage
Basic setup:
// Optionally use namespaces
use duzun\hQuery;
// Either use composer, or include this file:
include_once '/path/to/libs/hquery.php';
// Set the cache path - must be a writable folder
// If not set, hQuery::fromURL() would make a new request on each call
hQuery::$cache_path = "/path/to/cache";
// Time to keep request data in cache, seconds
// A value of 0 disables cache
hQuery::$cache_expires = 3600; // default one hour
I would recommend using php-http/cache-plugin with a PSR-7 client for better flexibility.
Load HTML from a file
hQuery::fromFile( string $filename, boolean $use_include_path = false, resource $context = NULL )
// Local
$doc = hQuery::fromFile('/path/to/filesystem/doc.html');
// Remote
$doc = hQuery::fromFile('https://example.com/', false, $context);
Where $context is created with stream_context_create().
For an example of using $context to make a HTTP request with proxy see #26.
Load HTML from a string
hQuery::fromHTML( string $html, string $url = NULL )
$doc = hQuery::fromHTML('<html><head><title>Sample HTML Doc</title><body>Contents...</body></html>');
// Set base_url, in case the document is loaded from local source.
// Note: The base_url property is used to retrieve absolute URLs from relative ones.
$doc->base_url = 'http://desired-host.net/path';
Load a remote HTML document
hQuery::fromUrl( string $url, array $headers = NULL, array|string $body = NULL, array $options = NULL )
use duzun\hQuery;
// GET the document
$doc = hQuery::fromUrl('http://example.com/someDoc.html', ['Accept' => 'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8']);
var_dump($doc->headers); // See response headers
var_dump(hQuery::$last_http_result); // See response details of last request
// with POST
$doc = hQuery::fromUrl(
'http://example.com/someDoc.html', // url
['Accept' => 'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8'], // headers
['username' => 'Me', 'fullname' => 'Just Me'], // request body - could be a string as well
['method' => 'POST', 'timeout' => 7, 'redirect' => 7, 'decode' => 'gzip'] // options
);
For building advanced requests (POST, parameters etc) see hQuery::http_wr(),
though I recommend using a specialized (PSR-7?) library for making requests
and hQuery::fromHTML($html, $url=NULL) for processing results.
See Guzzle for eg.
PSR-7 example:
composer require php-http/message php-http/discovery php-http/curl-client
If you don't have cURL PHP extension,
just replace php-http/curl-client with php-http/socket-client in the above command.
use duzun\hQuery;
use Http\Discovery\HttpClientDiscovery;
use Http\Discovery\MessageFactoryDiscovery;
$client = HttpClientDiscovery::find();
$messageFactory = MessageFactoryDiscovery::find();
$request = $messageFactory->createRequest(
'GET',
'http://example.com/someDoc.html',
['Accept' => 'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8']
);
$response = $client->sendRequest($request);
$doc = hQuery::fromHTML($response, $request->getUri());
Another option is to use stream_context_create()
to create a $context, then call hQuery::fromFile($url, false, $context).
Processing the results
hQuery::find( string $sel, array|string $attr = NULL, hQuery\Node $ctx = NULL )
// Find all banners (images inside anchors)
$banners = $doc->find('a[href] > img[src]:parent');
// Extract links and images
$links = array();
$images = array();
$titles = array();
// If the result of find() is not empty
// $banners is a collection of elements (hQuery\Element)
if ( $banners ) {
// Iterate over the result
foreach($banners as $id => $a) {
// $a->href property is the resolved $a->attr('href') relative to the
// documents <base href=...>, if present, or $doc->baseURL.
$links[$id] = $a->href; // get absolute URL from href property
$titles[$id] = trim($a->text()); // strip all HTML tags and leave just text
// Filter the result
if ( !$a->hasClass('logo') ) {
// $a->style property is the parsed $a->attr('style'), same as $a->attr('style', true)
if ( strtolower($a->style['position']) == 'fixed' ) continue;
$img = $a->find('img')[0]; // ArrayAccess
if ( $img ) $images[$id] = $img->src; // short for $img->attr('src', true)
}
}
// If at least one element has the class .home
if ( $banners->hasClass('home') ) {
echo 'There is .home button!', PHP_EOL;
// ArrayAccess for elements and properties.
if ( $banners[0]['href'] == '/' ) {
echo 'And it is the first one!';
}
}
}
// Read charset of the original document (internally it is converted to UTF-8)
$charset = $doc->charset;
// Get the size of the document ( strlen($html) )
$size = $doc->size;
// The URL at which the document was requested
$requestUri = $doc->href;
// <base href=...>, if present, or the origin + dir path part from $doc->href.
// The .href and .src props are resolved using this value.
$baseURL = $doc->baseURL;
Charset and positions:
- The document is converted internally to UTF-8 for parsing.
- Element positions (the numeric offsets used internally and returned by APIs that expose byte offsets) refer to the internal UTF-8 string bytes.
Note: In case the charset meta attribute has a wrong value or the internal conversion fails for any other reason, hQuery will continue processing with the original HTML, but will register an error message on $doc->html_errors['convert_encoding'].
🖧 Live Demo
On DUzun.Me
A lot of people ask for sources of my Live Demo page. Here we go:
view-source:https://duzun.me/playground/hquery
🏃 Run the playground
You can easily run any of the examples/ on your local machine.
All you need is PHP installed in your system.
After you clone the repo with git clone https://github.com/duzun/hQuery.php.git,
you have several options to start a web-server.
Option 1:
cd hQuery.php/examples
php -S localhost:8000
# open browser http://localhost:8000/
Option 2 (browser-sync):
This option starts a live-reload server and is good for playing with the code.
npm install
gulp
# open browser http://localhost:8080/
Option 3 (VSCode):
If you are using VSCode, simply open the project and run debugger (F5).
🔧 TODO
- Unit tests everything
- Document everything
- ~~Cookie support~~ (implemented in mem for redirects)
- ~~Improve selectors to be able to select by attributes~~
- Add more selectors
- Use HTTPlug internally
💖 Support my projects
I love Open Source. Whenever possible I share cool things with the world (check out NPM and GitHub).
If you like what I'm doing and this project helps you reduce time to develop, please consider to:
- ★ Star and Share the projects you like (and use)
Related Skills
node-connect
343.1kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
90.0kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
343.1kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
343.1kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
