SkillAgentSearch skills...

Metascraper

Get unified metadata from websites using Open Graph, Microdata, RDFa, Twitter Cards, JSON-LD, HTML, and more.

Install / Use

/learn @microlinkhq/Metascraper
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<div align="center"> <a href="https://metascraper.js.org"> <img style="width: 500px; margin:3rem 0 1.5rem;" src="https://metascraper.js.org/static/logo-banner.png" alt="metascraper"> </a> <br><br> <a href="https://microlink.io"><img src="https://img.shields.io/badge/powered_by-microlink.io-blue?style=flat-square&color=%23EA407B" alt="Powered by microlink.io"></a> <img alt="Last version" src="https://img.shields.io/github/tag/microlinkhq/metascraper.svg?style=flat-square"> <a href="https://coveralls.io/github/microlinkhq/metascraper"><img alt="Coverage Status" src="https://img.shields.io/coveralls/microlinkhq/metascraper.svg?style=flat-square"></a> <a href="https://www.npmjs.org/package/metascraper"><img alt="NPM Status" src="https://img.shields.io/npm/dm/metascraper.svg?style=flat-square"></a> <br><br> </div>

A library to easily extract unified metadata from websites using Open Graph, Microdata, RDFa, Twitter Cards, JSON-LD, HTML, and more.


What is it

The metascraper library allows you to easily scrape metadata from an article on the web using Open Graph metadata, regular HTML metadata, and a series of fallbacks.

It follows a few principles:

  • Ensure a high accuracy for online articles by default.
  • Make it simple to add new rules or override existing ones.
  • Don't restrict rules to CSS selectors or text accessors.

Getting started

Below is a real example of extracting metadata from a live website. The same logic shown here is running online and can be tested directly at microlink.io/meta:

<div align="center"> <a href="https://microlink.io/meta" target="_blank" rel="noopener"> <img align="center" src="/static/demo1.jpeg" style="margin-top: 1rem; margin-bottom: 1.5rem;"> </a> <br><br> </div>

metascraper requires two inputs: The target URL and the HTML markup behind that URL.

There are multiple ways to retrieve the HTML markup, but it needs to be as accurate as possible.

For that reason, we developed html-get, which uses a headless browser to retrieve HTML in a way that works seamlessly with metascraper.

const getHTML = require('html-get')

/**
 * `browserless` will be passed to `html-get`
 * as driver for getting the rendered HTML.
 */
const browserless = require('browserless')()

const getContent = async url => {
  // create a browser context inside the main Chromium process
  const browserContext = browserless.createContext()
  const promise = getHTML(url, { getBrowserless: () => browserContext })
  // close browser resources before return the result
  promise.then(() => browserContext).then(browser => browser.destroyContext())
  return promise
}

/**
 * `metascraper` is a collection of tiny packages,
 * so you can just use what you actually need.
 */
const metascraper = require('metascraper')([
  require('metascraper-author')(),
  require('metascraper-date')(),
  require('metascraper-description')(),
  require('metascraper-image')(),
  require('metascraper-logo')(),
  require('metascraper-publisher')(),
  require('metascraper-title')(),
  require('metascraper-url')()
])

/**
 * The main logic
 */
getContent('https://microlink.io')
  .then(metascraper)
  .then(metadata => console.log(metadata))
  .then(browserless.close)
  .then(process.exit)

The output will be something like:

{
  "author": "Microlink HQ",
  "date": "2022-07-10T22:53:04.856Z",
  "description": "Enter a URL, receive information. Normalize metadata. Get HTML markup. Take a screenshot. Identify tech stack. Generate a PDF. Automate web scraping. Run Lighthouse",
  "image": "https://cdn.microlink.io/logo/banner.jpeg",
  "logo": "https://cdn.microlink.io/logo/logo.png",
  "publisher": "Microlink",
  "title": "Turns websites into data — Microlink",
  "url": "https://microlink.io/"
}

What data it detects

Note: Custom metadata detection can be defined using a rule bundle.

Here is an example of the metadata that metascraper can detect:

  • audio — e.g. <small>ht<span>tps://cf-media.sndcdn.com/U78RIfDPV6ok.128.mp3</small><br/> A audio URL that best represents the article.

  • author — e.g. <small>Noah Kulwin</small><br/> A human-readable representation of the author's name.

  • date — e.g. <small>2016-05-27T00:00:00.000Z</small><br/> An ISO 8601 representation of the date the article was published.

  • description — e.g. <small>Venture capitalists are raising money at the fastest rate...</small><br/> The publisher's chosen description of the article.

  • video — e.g. <small>ht<span>tps://assets.entrepreneur.com/content/preview.mp4</small><br/> A video URL that best represents the article.

  • image — e.g. <small>ht<span>tps://assets.entrepreneur.com/content/3x2/1300/20160504155601-GettyImages-174457162.jpeg</small><br/> An image URL that best represents the article.

  • lang — e.g. <small>en</small><br/> An ISO 639-1 representation of the url content language.

  • logo — e.g. <small>ht<span>tps://entrepreneur.com/favicon180x180.png</small><br/> An image URL that best represents the publisher brand.

  • publisher — e.g. <small>Fast Company</small><br/> A human-readable representation of the publisher's name.

  • title — e.g. <small>Meet Wall Street's New A.I. Sheriffs</small><br/> The publisher's chosen title of the article.

  • url — e.g. <small>ht<span>tp://motherboard.vice.com/read/google-wins-trial-against-oracle-saves-9-billion</small><br/> The URL of the article.

The cloud API solution

Running this at scale means operating headless browsers, proxies, and antibot workarounds.

If you don’t want to manage that infrastructure, you can use the fully managed Microlink API.

It automatically handles proxy rotation, paywalls, bot detection, and restricted platforms such as major social networks, while scaling on demand.

Pricing is pay-as-you-go and starts for free.

How it works

metascraper is built out of rules bundles.

It is designed to be extensible. You can compose your own transformation pipeline using existing rules or create your own.

Rule bundles are collections of HTML selectors targeting a specific property. When you load the library, it implicitly loads the core rules.

Each set of rules loads a set of selectors to extract a specific value.

Rules are ordered by priority. The first rule to successfully resolve the value stops the process. The order goes from most specific to most generic.

Rules work as fallbacks for one another:

  • If the first rule fails, then it falls back on the second rule.
  • If the second rule fails, it is time for the third rule.
  • Etc.

metascraper does this until it finishes all the rules or finds the first rule that resolves the value.

Importing rules

metascraper exports a constructor that need to be initialized providing a collection of rules to load:

const metascraper = require('metascraper')([
  require('metascraper-author')(),
  require('metascraper-date')(),
  require('metascraper-description')(),
  require('metascraper-image')(),
  require('metascraper-logo')(),
  require('metascraper-publisher')(),
  require('metascraper-title')(),
  require('metascraper-url')()
])

Again, the order of rules are loaded are important: Just the first rule that resolve the value will be applied.

Use the first parameter to pass custom options specific per each rules bundle:

const metascraper = require('metascraper')([
  require('metascraper-logo')({
    filter: url => url.endsWith('.png')
  })
])

Rules bundles

?> Can't find the rules bundle that you want? Let's open an issue to create it.

Official

Rules bundles maintained by metascraper maintainers.

Core essential

Related Skills

View on GitHub
GitHub Stars2.7k
CategoryDevelopment
Updated7h ago
Forks183

Languages

HTML

Security Score

100/100

Audited on Mar 28, 2026

No findings