Claw
Simple Node.js web scraper chassis
Install / Use
/learn @dylanized/ClawREADME
CLAW
A very simple web scraper chassis. Claw takes a web page (or list of pages), scrapes some info from those pages, then dumps the results to a JSON or CSV file.
How to Use
Accepts parameters
- a page url, or array of URLs
- a selection to scrape
- fields to pull out from within that section
- an output folder
- number of seconds to delay
var claw = require('claw');
var claw_config = {
pages : [
'http://www.bing.com/search?q=hello',
'http://www.bing.com/search?q=goodbye'
],
selector : 'h3',
fields : {
"text" : "$(sel).find('a').text()",
"href" : "$(sel).find('a').attr('href')"
},
delay : 3
}
claw.init(claw_config);
Exported Data
Each page gets saved to a separate output file.
Here's what the exported JSON looks like:
[
{
"text": "HELLO! Online: celeb & royal news, magazine, babies, weddings",
"href": "http://www.hellomagazine.com/"
},
{
"text": "Hello - Wikipedia, the free encyclopedia",
"href": "http://en.wikipedia.org/wiki/Hello"
},
{
"text": "Hello | Define Hello at Dictionary.com",
"href": "http://dictionary.reference.com/browse/hello"
}
]
and here's the CSV:
text,href
"HELLO! Online: celeb & royal news, magazine, babies, weddings, É","http://www.hellomagazine.com/"
"Hello - Wikipedia, the free encyclopedia","http://en.wikipedia.org/wiki/Hello"
"Hello | Define Hello at Dictionary.com","http://dictionary.reference.com/browse/hello"
External Config
Claw can import its configuration from a JSON file:
claw('sample1.json');
The file looks like this:
{
"pages" : [
"http://www.bing.com/search?q=hello",
"http://www.bing.com/search?q=goodbye"
],
"selector" : "h3",
"fields" : {
"href" : "$(sel).find('a').attr('href')"
},
"delay" : 5
}
Command Line
You can also use claw from the command line. First, install it globally:
npm install -g claw
Then run it in the same folder as your config file:
claw sample1.json
This will create a folder called sample1, with your results.
External Page List
Claw can grab its page list from a JSON file that is a list of urls (or an object with .href properties). Instead of an array, just set "pages" to a file name and path.
claw_config.pages = "pages.json";
Feedback
Questions? Ideas? Hit me up on twitter - @dylanized
Related Skills
node-connect
336.5kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
82.9kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
336.5kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
82.9kCommit, push, and open a PR
