WebCollector
WebCollector is an open source web crawler framework based on Java.It provides some simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes.
Install / Use
/learn @CrawlScript/WebCollectorREADME
WebCollector
WebCollector is an open source web crawler framework based on Java.It provides some simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes.
HomePage
https://github.com/CrawlScript/WebCollector
Installation
Using Maven
<dependency>
<groupId>cn.edu.hfut.dmic.webcollector</groupId>
<artifactId>WebCollector</artifactId>
<version>2.73-alpha</version>
</dependency>
Without Maven
WebCollector jars are available on the HomePage.
- webcollector-version-bin.zip contains core jars.
Example Index
Annotation versions are named with DemoAnnotatedxxxxxx.java.
Basic
- DemoAutoNewsCrawler.java | DemoAnnotatedAutoNewsCrawler.java
- DemoManualNewsCrawler.java | DemoAnnotatedManualNewsCrawler.java
- DemoExceptionCrawler.java
CrawlDatum and MetaData
- DemoMetaCrawler.java
- DemoAnnotatedMatchTypeCrawler.java
- DemoAnnotatedDepthCrawler.java
- DemoBingCrawler.java | DemoAnnotatedBingCrawler.java
- DemoAnnotatedDepthCrawler.java
Http Request and Javascript
- DemoCookieCrawler.java
- DemoRedirectCrawler.java | DemoAnnotatedRedirectCrawler.java
- DemoPostCrawler.java
- DemoRandomProxyCrawler.java
- AbuyunDynamicProxyRequester.java
- DemoSeleniumCrawler.java
NextFilter
Quickstart
Lets crawl some news from github news.This demo prints out the titles and contents extracted from news of github news.
Automatically Detecting URLs
import cn.edu.hfut.dmic.webcollector.model.CrawlDatums;
import cn.edu.hfut.dmic.webcollector.model.Page;
import cn.edu.hfut.dmic.webcollector.plugin.rocks.BreadthCrawler;
/**
* Crawling news from github news
*
* @author hu
*/
public class DemoAutoNewsCrawler extends BreadthCrawler {
/**
* @param crawlPath crawlPath is the path of the directory which maintains
* information of this crawler
* @param autoParse if autoParse is true,BreadthCrawler will auto extract
* links which match regex rules from pag
*/
public DemoAutoNewsCrawler(String crawlPath, boolean autoParse) {
super(crawlPath, autoParse);
/*start pages*/
this.addSeed("https://blog.github.com/");
for(int pageIndex = 2; pageIndex <= 5; pageIndex++) {
String seedUrl = String.format("https://blog.github.com/page/%d/", pageIndex);
this.addSeed(seedUrl);
}
/*fetch url like "https://blog.github.com/2018-07-13-graphql-for-octokit/" */
this.addRegex("https://blog.github.com/[0-9]{4}-[0-9]{2}-[0-9]{2}-[^/]+/");
/*do not fetch jpg|png|gif*/
//this.addRegex("-.*\\.(jpg|png|gif).*");
/*do not fetch url contains #*/
//this.addRegex("-.*#.*");
setThreads(50);
getConf().setTopN(100);
//enable resumable mode
//setResumable(true);
}
@Override
public void visit(Page page, CrawlDatums next) {
String url = page.url();
/*if page is news page*/
if (page.matchUrl("https://blog.github.com/[0-9]{4}-[0-9]{2}-[0-9]{2}[^/]+/")) {
/*extract title and content of news by css selector*/
String title = page.select("h1[class=lh-condensed]").first().text();
String content = page.selectText("div.content.markdown-body");
System.out.println("URL:\n" + url);
System.out.println("title:\n" + title);
System.out.println("content:\n" + content);
/*If you want to add urls to crawl,add them to nextLink*/
/*WebCollector automatically filters links that have been fetched before*/
/*If autoParse is true and the link you add to nextLinks does not match the
regex rules,the link will also been filtered.*/
//next.add("http://xxxxxx.com");
}
}
public static void main(String[] args) throws Exception {
DemoAutoNewsCrawler crawler = new DemoAutoNewsCrawler("crawl", true);
/*start crawl with depth of 4*/
crawler.start(4);
}
}
Manually Detecting URLs
import cn.edu.hfut.dmic.webcollector.model.CrawlDatums;
import cn.edu.hfut.dmic.webcollector.model.Page;
import cn.edu.hfut.dmic.webcollector.plugin.rocks.BreadthCrawler;
/**
* Crawling news from github news
*
* @author hu
*/
public class DemoManualNewsCrawler extends BreadthCrawler {
/**
* @param crawlPath crawlPath is the path of the directory which maintains
* information of this crawler
* @param autoParse if autoParse is true,BreadthCrawler will auto extract
* links which match regex rules from pag
*/
public DemoManualNewsCrawler(String crawlPath, boolean autoParse) {
super(crawlPath, autoParse);
// add 5 start pages and set their type to "list"
//"list" is not a reserved word, you can use other string instead
this.addSeedAndReturn("https://blog.github.com/").type("list");
for(int pageIndex = 2; pageIndex <= 5; pageIndex++) {
String seedUrl = String.format("https://blog.github.com/page/%d/", pageIndex);
this.addSeed(seedUrl, "list");
}
setThreads(50);
getConf().setTopN(100);
//enable resumable mode
//setResumable(true);
}
@Override
public void visit(Page page, CrawlDatums next) {
String url = page.url();
if (page.matchType("list")) {
/*if type is "list"*/
/*detect content page by css selector and mark their types as "content"*/
next.add(page.links("h1.lh-condensed>a")).type("content");
}else if(page.matchType("content")) {
/*if type is "content"*/
/*extract title and content of news by css selector*/
String title = page.select("h1[class=lh-condensed]").first().text();
String content = page.selectText("div.content.markdown-body");
//read title_prefix and content_length_limit from configuration
title = getConf().getString("title_prefix") + title;
content = content.substring(0, getConf().getInteger("content_length_limit"));
System.out.println("URL:\n" + url);
System.out.println("title:\n" + title);
System.out.println("content:\n" + content);
}
}
public static void main(String[] args) throws Exception {
DemoManualNewsCrawler crawler = new DemoManualNewsCrawler("crawl", false);
crawler.getConf().setExecuteInterval(5000);
crawler.getConf().set("title_prefix","PREFIX_");
crawler.getConf().set("content_length_limit", 20);
/*start crawl with depth of 4*/
crawler.start(4);
}
}
CrawlDatum
CrawlDatum is an important data structure in WebCollector, which corresponds to url of webpages. Both crawled urls and detected urls are maintained as CrawlDatums.
There are some differences between CrawlDatum and url:
- A CrawlDatum contains a key and a url. The key is the url by default. You can set the key manually by CrawlDatum.key("xxxxx") so that CrawlDatums with the same url may have different keys. This is very useful in some tasks like crawling data by api, which often request different data by the same url with different post parameters.
- A CrawlDatum may contain metadata, which could maintain some information besides the url.
Manually Detecting URLs
In both void visit(Page page, CrawlDatums next) and void execute(Page page, CrawlDatums next), the second parameter CrawlDatum next is a container which you should put the detected URLs in:
//add one detected URL
n
Related Skills
node-connect
339.1kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
83.8kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
339.1kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
83.8kCommit, push, and open a PR
