XRay

X-Ray returns structured data from any URL

Generate Convert Improve

Install / Use

/learn @aaronpk/XRay

About this skill

Quality Score

0/100

README

XRay

XRay parses structured content from a URL.

Discovering Content

XRay will parse content in the following formats. First the URL is checked against known services:

GitHub
XKCD
Hackernews

If the contents of the URL is XML or JSON, then XRay will parse the Atom, RSS or JSONFeed formats.

Finally, XRay looks for Microformats on the page and will determine the content from that.

h-card
h-entry
h-event
h-review
h-recipe
h-product
h-item
h-feed

Library

XRay can be used as a library in your PHP project. The easiest way to install it and its dependencies is via composer.

composer require p3k/xray

You can also download a release which is a zip file with all dependencies already installed.

Parsing

$xray = new p3k\XRay();
$parsed = $xray->parse('https://aaronparecki.com/2017/04/28/9/');

If you already have an HTML or JSON document you want to parse, you can pass it as a string in the second parameter.

$xray = new p3k\XRay();
$html = '<html>....</html>';
$parsed = $xray->parse('https://aaronparecki.com/2017/04/28/9/', $html);

$xray = new p3k\XRay();
$jsonfeed = '{"version":"https://jsonfeed.org/version/1","title":"Manton Reece", ... ';
// Note that the JSON document must be passed in as a string in this case
$parsed = $xray->parse('https://manton.micro.blog/feed.json', $jsonfeed);

In both cases, you can add an additional parameter to configure various options of how XRay will behave. Below is a list of the options.

timeout - The timeout in seconds to wait for any HTTP requests
max_redirects - The maximum number of redirects to follow
include_original - Will also return the full document fetched
target - Specify a target URL, and XRay will first check if that URL is on the page, and only if it is, will continue to parse the page. This is useful when you're using XRay to verify an incoming webmention.
expect=feed - If you know the thing you are parsing is a feed, include this parameter which will avoid running the autodetection rules and will provide better results for some feeds.
accept - (options: html, json, activitypub, xml) - Without this parameter, XRay sends a default Accept header to prioritize getting the most likely best result from a page. If you are parsing a page for a specific purpose and expect to find only one type of content (e.g. webmentions will probably only be from HTML pages), you can include this parameter to adjust the Accept header XRay sends.

Additional parameters are supported when making requests that use the GitHub API. See the Authentication section below for details.

The XRay constructor can optionally be passed an array of default options, which will be applied in addition to (and can be overridden by) the options passed to individual parse() calls.

$xray = new p3k\XRay([
  'timeout' => 30 // Time-out all requests which take longer than 30s
]);

$parsed = $xray->parse('https://aaronparecki.com/2017/04/28/9/', [
  'timeout' => 40 // Override the default 30s timeout for this specific request
]);

$parsed = $xray->parse('https://aaronparecki.com/2017/04/28/9/', $html, [
  'target' => 'http://example.com/'
]);

The $parsed return value will look like the below. See "Primary Data" below for an explanation of the vocabularies returned.

$parsed = Array
(
    [data] => Array
        (
            [type] => card
            [name] => Aaron Parecki
            [url] => https://aaronparecki.com/
            [photo] => https://aaronparecki.com/images/profile.jpg
        )

    [url] => https://aaronparecki.com/
    [code] => 200,
    [source-format] => mf2+html
)

Processing Microformats2 JSON

If you already have a parsed Microformats2 document as an array, you can use a special function to process it into XRay's native format. Make sure you pass the entire parsed document, not just the single item.

$html = '<div class="h-entry"><p class="p-content p-name">Hello World</p><img src="/photo.jpg"></p></div>';
$mf2 = Mf2\parse($html, 'http://example.com/entry');

$xray = new p3k\XRay();
$parsed = $xray->process('http://example.com/entry', $mf2); // note the use of `process` not `parse`

Array
(
    [data] => Array
        (
            [type] => entry
            [post-type] => photo
            [photo] => Array
                (
                    [0] => http://example.com/photo.jpg
                )

            [content] => Array
                (
                    [text] => Hello World
                )

        )

    [url] => http://example.com/entry

    [source-format] => mf2+json
)

Rels

You can also use XRay to fetch all the rel values on a page, merging the list of HTTP Link headers with rel values with the HTML rel values on the page.

$xray = new p3k\XRay();
$rels = $xray->rels('https://aaronparecki.com/');

This will return a similar response to the parser, but instead of a data key containing the parsed page, there will be rels, an associative array. Each key will contain an array of all the values that match that rel value.

Array
(
    [url] => https://aaronparecki.com/
    [code] => 200
    [rels] => Array
        (
            [hub] => Array
                (
                    [0] => https://switchboard.p3k.io/
                )

            [authorization_endpoint] => Array
                (
                    [0] => https://aaronparecki.com/auth
                )
            ...

Feed Discovery

You can use XRay to discover the types of feeds available at a URL.

$xray = new p3k\XRay();
$feeds = $xray->feeds('http://percolator.today');

This will fetch the URL, check for a Microformats feed, as well as check for rel=alternates pointing to Atom, RSS or JSONFeed URLs. The response will look like the below.

Array
(
    [url] => https://percolator.today/
    [code] => 200
    [feeds] => Array
        (
            [0] => Array
                (
                    [url] => https://percolator.today/
                    [type] => microformats
                )

            [1] => Array
                (
                    [url] => https://percolator.today/podcast.xml
                    [type] => rss
                )

        )

)

Customizing the User Agent

To set a unique user agent, (some websites will require a user agent be set), you can set the http property of the object to a p3k\HTTP object.

$xray = new p3k\XRay();
$xray->http = new p3k\HTTP('MyProject/1.0.0 (http://example.com/)');
$xray->parse('http://example.com/');

API

XRay can also be used as an API to provide its parsing capabilities over an HTTP service.

To parse a page and return structured data for the contents of the page, simply pass a url to the /parse route.

GET /parse?url=https://aaronparecki.com/2016/01/16/11/

To conditionally parse the page after first checking if it contains a link to a target URL, also include the target URL as a parameter. This is useful when using XRay to verify an incoming webmention.

GET /parse?url=https://aaronparecki.com/2016/01/16/11/&target=http://example.com

In both cases, the response will be a JSON object containing a key of "type". If there was an error, "type" will be set to the string "error", otherwise it will refer to the kind of content that was found at the URL, most often "entry".

You can also make a POST request with the same parameter names.

If you already have an HTML or JSON document you want to parse, you can include that in the POST parameter body. This POST request would look like the below:

POST /parse
Content-type: application/x-www-form-urlencoded

url=https://aaronparecki.com/2016/01/16/11/
&body=<html>....</html>

or for GitHub where you might have JSON,

POST /parse
Content-type: application/x-www-form-urlencoded

url=https://github.com/aaronpk/XRay
&body={"repo":......}

Parameters

XRay accepts the following parameters when calling /parse

url - the URL of the page to parse
target - Specify a target URL, and XRay will first check if that URL is on the page, and only if it is, will continue to parse the page. This is useful when you're using XRay to verify an incoming webmention.
timeout - The timeout in seconds to wait for any HTTP requests
max_redirects - The maximum number of redirects to follow
include_original - Will also return the full document fetched
expect=feed - If you know the thing you are parsing is a feed, include this parameter which will avoid running the autodetection rules and will provide better results for some feeds.

Authentication

If the URL you are fetching requires authentication, include the access token in the parameter "token", and it will be included in an "Authorization" header when fetching the URL. (It is recommended to use a POST request in this case, to avoid the access token potentially being logged as part of the query string.) This is useful for Private Webmention verification.

POST /parse

url=https://aaronparecki.com/2016/01/16/11/
&target=http://example.com
&token=12341234123412341234

API Authentication

XRay uses the Github APIs to fetch posts, and those API require authentication. In order to keep XRay stateless, it is required that you pass in the credentials to the parse call.

You should only send the credentials when the URL you are trying to parse is a GitHub URL, so you'll want to check for whether the hostname is github.com, etc. before you include credentials in this call.

GitHub Authentication

XRay uses the GitHub API to fetch GitHub URLs, which provides higher rate limits when used with authentication. You can pass a GitHub access token along with the request and XRay will use it when making requests to the API.

github_access_token - A GitHub access token

Related Skills

node-connect

343.3k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

92.1k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

343.3k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

343.3k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。