SkillAgentSearch skills...

Spindel

A web crawler/spider

Install / Use

/learn @joakimbeng/Spindel
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

spindel

Build status NPM version XO code style

A web crawler/spider

"spindel" is the Swedish word for spider.

Installation

Install spindel using npm:

npm install --save spindel

Usage

Module usage

Start with single url

const spindel = require('spindel');

// Start a crawler at http://example.com:
const stream = spindel('http://example.com');

stream.on('data', res => {
	// see response object format below
});

Start with multiple urls

// Start a crawler with an initial queue consisting of two urls:
const stream = spindel([
	'http://example.com',
	'http://another.com'
]);

stream.on('data', res => {
	// see response object format below
});

Use a database as url queue

// Start a crawler with a custom queue:
const redisQueue = {
	popUrl() {
		return getNextUrlFromRedisAndReturnAPromise();
	},
	pushUrl(url) {
		return pushUrlToRedisAndReturnAPromise(url);
	}
};
const stream = spindel(redisQueue);

stream.on('data', res => {
	// see response object format below
});

API

spindel(urlsOrQueue, options)

| Name | Type | Description | |------|------|-------------| | urlsOrQueue | String, Array or Object | A single url, an array of urls or a queue implementation | | options | Object | The options object |

Returns: stream.Readable which emits response objects on the 'data' event.

Options

options.transformHtml

Type: Function
Default: noop

Params:

| Name | Type | Description | |------|------|-------------| | body | String | The response body | | url | String | The url for the page being crawled | res | Object | The full response object |

Return value: Any or Promise<Any>.

For responses containing HTML (i.e. having a content-type which begins with text/ and ends with html) this function will be run and its return value will be set to transformedHtml in the response object.

options.gotOptions

Type: Object
Default: {}

Options passed to got.

Streamed response objects

A response object has the format:

{
	url: String, // the crawled url
	statusCode: Number, // the HTTP status code
	statusMessage: String, // the HTTP status message
	body: String, // the response body
	headers: Object, // the HTTP response headers
	hrefs: Array(String), // found <a href /> urls in the body if content is HTML
	transformedHtml: String // if content is HTML this contains the `body` after applying the `transformHtml` option function
}

Queue implementation

A queue implementation consists of two functions popUrl and pushUrl.

queue.popUrl

Type: function

Params:

| Name | Type | Description | |------|------|-------------| | lastUrl | String | The last crawled url, or null for the first url |

Should return: String or Promise<String> to continue crawling or null or Promise<null> to stop crawling.

queue.pushUrl

Type: function

Params:

| Name | Type | Description | |------|------|-------------| | href | String | A found href in the currently crawled response body | | referral | String | The url for the current crawl |

Should return: nothing or Promise.

Example of the internal ArrayQueue
function arrayQueue(initialUrls) {
	const urls = initialUrls.slice();

	return {
		pushUrl(url) {
			urls.push(url);
		},
		popUrl() {
			return urls.pop();
		}
	};
}

The queue implementation above is used if spindel's urlsOrQueue parameter is a String or Array.

License

MIT © Joakim Carlstein

View on GitHub
GitHub Stars4
CategoryDevelopment
Updated6y ago
Forks0

Languages

JavaScript

Security Score

55/100

Audited on Mar 1, 2020

No findings