Results for "nutch"

Claude Code Claude Desktop GitHub Copilot Cursor Windsurf Cline Zed JetBrains

📄SKILL.md 🤖CLAUDE.md ⚡Claude Commands 📐.cursorrules 📐Cursor Rules 🕹️AGENTS.md 🧬codex.md 🏄.windsurfrules 🔧.clinerules 🧑‍✈️Copilot Instructions

All Development Operations Data Product Marketing Customer Design Sales

70 skills found · Page 1 of 3

apache / Nutch

3.1k

Apache Nutch is an extensible and scalable web crawler

universal

apachecrawlinghadoop+3

Updated 17h ago

YahooArchive / Anthelion

2.8k

Anthelion is a plugin for Apache Nutch to crawl semantic annotations within HTML pages.

universal

Updated 10d ago

USCDataScience / Sparkler

420

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.

universal

big-datadistributed-systemsinformation-retrieval+7

Updated 1mo ago

CrawlScript / Nutcher

130

nutcher是中文的nutch文档，包含nutch的配置和源码解析，持续更新中。

universal

Updated 1y ago

nasa-jpl-memex / Memex Explorer

127

Viewers for statistics and dashboarding of Domain Search Engine data

universal

acheanacondaapache+7

Updated 11d ago

xautlx / Nutch Htmlunit

125

基于Apache Nutch和Htmlunit的扩展实现AJAX页面爬虫抓取解析插件

universal

Updated 9d ago

daijiale / OCR FontsSearchEngine

109

A OCR Search Engine With Tesseract Nutch Solr And PHP

universal

fontmac-tesseractnutch+4

Updated 1mo ago

xautlx / Nutch Ajax

Apache Nutch Plugins for AJAX page fetch, parse, index

universal

Updated 1mo ago

larroy / Mycelium

An open source information retrieval system written in C++11 and Python. Aspires to be an alternative to Nutch / Lucene. It uses MongoDB as an storage engine.

universal

Updated 2y ago

BayanGroup / Nutch Custom Search

No description available

universal

Updated 1y ago

arquivo / Pwa Technologies

Arquivo.pt main goal is the preservation and access of web contents that are no longer available online. During the developing of the PWA IR (information retrieval) system we faced limitations in searching speed, quality of results, scalability and usability. To cope with this, we modified the archive-access project (http://archive-access.sourceforge.net/) to support our web archive IR requirements. Nutchwax, Nutch and Wayback’s code were adapted to meet the requirements. Several optimizations were added, such as simplifications in the way document versions are searched and several bottlenecks were resolved. The PWA search engine is a public service at http://archive.pt and a research platform for web archiving. As it predecessor Nutch, it runs over Hadoop clusters for distributed computing following the map-reduce paradigm. Its major features include fast full-text search, URL search, phrase search, faceted search (date, format, site), and sorting by relevance and date. The PWA search engine is highly scalable and its architecture is flexible enough to enable the deployment of different configurations to respond to the different needs. Currently, it serves an archive collection searchable by full-text with 180 million documents ranging between 1996 and 2010.

universal

chrismattmann / Nutch Python

Nutch-Python is a Python binding to the Apache Nutch™ REST services allowing Nutch to be called natively in the Python community. — Edit

universal

Updated 1y ago

ATLANTBH / Nutch Plugins

Apache Nutch extensions

universal

Updated 4mo ago

LQZYC / Nutch NewsClassify

基于nutch的新闻分类系统

universal

Updated 7mo ago

ContinuumIO / Nutchpy

For interacting with nutch via Python

universal

Updated 5mo ago

momer / Nutch Selenium

No description available

universal

Updated 5y ago

eleflow / Nutch Aws

No description available

universal

Updated 1y ago

yxjay / SearchEngine Base On ElasticSearch Nutch SSM

基于Nutch+ElasticSearch+MySQL+SSM的简易搜索引擎

universal

Updated 1y ago

ly16 / GooglePlay Web Crawler

Mapreduce project by Hadoop, Nutch, AWS EMR, Pig, Tez, Hive

universal

awsemrhadoop+6

Updated 2y ago

WING-NUS / Kairos

Kairos, combines a focused crawler and an information extraction engine, to convert a list of conference websites into a index filled with fields of metadata that correspond to individual papers. Using event date metadata extracted from the conference website, Kairos proactively harvests metadata about the individual papers soon after they are made public. We use a Maximum Entropy classifier to classify uniform resource locators (URLs) as scientific conference websites and use Conditional Random Fields (CRF) to extract individual paper metadata from such websites. The crawler is built on top of the popular open-source crawler Nutch.

universal

Updated 3y ago