Domp

Web scraping, crawling and DOM tree manipulation for Node.js.

Generate Convert Improve

Install / Use

/learn @mateogianolio/Domp

About this skill

Quality Score

0/100

README

domp

Web scraping, crawling and DOM tree manipulation for Node.js. Uses htmlparser2 for HTML parsing and robots-txt for robots.txt checking.

$ npm install domp

var domp = require('domp');

Usage

Get single page (`examples/single.js`)

domp(url, function(dom) {
  console.log(...dom.map(node => node.name));
  // html head meta title script ...
});

Get multiple pages (`examples/multiple.js`)

You can scrape an Array of urls by

providing a callback:

domp(urls, function(dom) {
  // called twice
})

looping through an iterator

for (var page of domp(urls))
  page.then(function (dom) {
    // resolved
  }, function (error) {
    // rejected
  });

Crawling (`examples/crawl.js`)

function resolve(next) {
  return function (dom) {
    var title = dom.find('title').next().value,
        links = [...dom.filter(node => node.href && node.href.indexOf('http') === 0)];

    // get random link
    var link = links[Math.floor(Math.random() * links.length)];

    console.log(title.text);
    console.log(link.href);

    // submit link(s) to be scraped next
    next(link.href);
  };
}

domp.crawl('https://en.wikipedia.org', function(requests, next) {
  for (var request of requests)
    request.then(resolve(next));
});

DOM Tree traversal

Standard traversal using for ... of:

for (var node of dom)
  console.log(node);

Sibling (children with same parent) traversal using for ... of:

for (var sibling of node.siblings)
  console.log(sibling);

Tag name traversal using for ... of and find(name):

for (var node of dom.find('p'))
  console.log(node);

DOM Manipulation

DOM nodes (see node.js) implement mapping similar to what we're used to from Array.prototype.map, but instead of returning an Array it returns an Iterable. The Iterable can either be unpacked into an Array using the spread operator (...) or be used as a normal iterator.

var names = dom.map(node => node.name);

names = [...names];
// names = ['html', 'head', 'meta', 'title', ...]

for (var name of names)
  console.log(name);
// html
// head
// ...

Filtering works pretty much the same (returns Iterable):

// get all 'p' tags
var paragraphs = dom.filter(node => node.name === 'p');

// traverse
for (var p of paragraphs)
  console.log(p);

There's also the short find(name) that can be used to find tag names in the tree:

for (var node in dom.find('p'))
  console.log(node);

Related Skills

node-connect

347.6k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

108.4k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

347.6k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

347.6k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。

mateogianolio

View profile

View on GitHub

GitHub Stars14

CategoryDevelopment

Updated6y ago

Forks0

mateogianolio/domp

Languages

JavaScript

Security Score

75/100

Audited on Jul 15, 2019

No findings

Domp

Install / Use

README

domp

Usage

Get single page (examples/single.js)

Get multiple pages (examples/multiple.js)

Crawling (examples/crawl.js)

DOM Tree traversal

DOM Manipulation

Related Skills

Get single page (`examples/single.js`)

Get multiple pages (`examples/multiple.js`)

Crawling (`examples/crawl.js`)