70 skills found · Page 1 of 3
apache / NutchApache Nutch is an extensible and scalable web crawler
YahooArchive / AnthelionAnthelion is a plugin for Apache Nutch to crawl semantic annotations within HTML pages.
USCDataScience / SparklerSpark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
CrawlScript / Nutchernutcher是中文的nutch文档,包含nutch的配置和源码解析,持续更新中。
nasa-jpl-memex / Memex ExplorerViewers for statistics and dashboarding of Domain Search Engine data
xautlx / Nutch Htmlunit基于Apache Nutch和Htmlunit的扩展实现AJAX页面爬虫抓取解析插件
daijiale / OCR FontsSearchEngineA OCR Search Engine With Tesseract Nutch Solr And PHP
xautlx / Nutch AjaxApache Nutch Plugins for AJAX page fetch, parse, index
larroy / MyceliumAn open source information retrieval system written in C++11 and Python. Aspires to be an alternative to Nutch / Lucene. It uses MongoDB as an storage engine.
BayanGroup / Nutch Custom SearchNo description available
arquivo / Pwa TechnologiesArquivo.pt main goal is the preservation and access of web contents that are no longer available online. During the developing of the PWA IR (information retrieval) system we faced limitations in searching speed, quality of results, scalability and usability. To cope with this, we modified the archive-access project (http://archive-access.sourceforge.net/) to support our web archive IR requirements. Nutchwax, Nutch and Wayback’s code were adapted to meet the requirements. Several optimizations were added, such as simplifications in the way document versions are searched and several bottlenecks were resolved. The PWA search engine is a public service at http://archive.pt and a research platform for web archiving. As it predecessor Nutch, it runs over Hadoop clusters for distributed computing following the map-reduce paradigm. Its major features include fast full-text search, URL search, phrase search, faceted search (date, format, site), and sorting by relevance and date. The PWA search engine is highly scalable and its architecture is flexible enough to enable the deployment of different configurations to respond to the different needs. Currently, it serves an archive collection searchable by full-text with 180 million documents ranging between 1996 and 2010.
chrismattmann / Nutch PythonNutch-Python is a Python binding to the Apache Nutch™ REST services allowing Nutch to be called natively in the Python community. — Edit
ATLANTBH / Nutch PluginsApache Nutch extensions
LQZYC / Nutch NewsClassify基于nutch的新闻分类系统
ContinuumIO / NutchpyFor interacting with nutch via Python
momer / Nutch SeleniumNo description available
eleflow / Nutch AwsNo description available
yxjay / SearchEngine Base On ElasticSearch Nutch SSM基于Nutch+ElasticSearch+MySQL+SSM的简易搜索引擎
ly16 / GooglePlay Web CrawlerMapreduce project by Hadoop, Nutch, AWS EMR, Pig, Tez, Hive
WING-NUS / KairosKairos, combines a focused crawler and an information extraction engine, to convert a list of conference websites into a index filled with fields of metadata that correspond to individual papers. Using event date metadata extracted from the conference website, Kairos proactively harvests metadata about the individual papers soon after they are made public. We use a Maximum Entropy classifier to classify uniform resource locators (URLs) as scientific conference websites and use Conditional Random Fields (CRF) to extract individual paper metadata from such websites. The crawler is built on top of the popular open-source crawler Nutch.