483 skills found · Page 1 of 17
getmaxun / Maxun🔥 The open-source no-code platform for web scraping, crawling, search and AI data extraction • Turn websites into structured APIs in minutes 🔥
waditu / TushareTuShare is a utility for crawling historical data of China stocks
oxylabs / AI Crawler PyCrawl a website starting from a URL, find relevant pages, and extract data – all guided by your natural language prompt.
oxylabs / Oxylabs AI Studio PyStructured data gathering from any website using AI-powered scraper, crawler, and browser automation. Scraping and crawling with natural language prompts. Equip your LLM agents with fresh data. AI Studio python SDK for intelligent web data gathering.
fugary / Calibre DoubanCalibre new douban metadata source plugin. Douban no longer provides book APIs to the public, so it can only use web crawling to obtain data. This is a calibre Douban plugin based on web crawling.
facebookresearch / Cc NetTools to download and cleanup Common Crawl data
arkadiyt / Bounty TargetsThis project crawls bug bounty platform scopes (like Hackerone/Bugcrowd/Intigriti/etc) hourly and dumps them into the bounty-targets-data repo
PhialsBasement / LibreCrawlFree desktop SEO crawler - open source alternative to Screaming Frog and similar tools. Crawl websites, analyze links, extract SEO data, and export results without subscription fees. Fully customizable and extensible!
blackfireio / PlayerBlackfire Player is a powerful Web Crawling, Web Testing, and Web Scraper application. It provides a nice DSL to crawl HTTP services, assert responses, and extract data from HTML/XML/JSON responses.
shaohua0116 / ICLR2020 OpenReviewDataScript that crawls meta data from ICLR OpenReview webpage. Tutorials on installing and using Selenium and ChromeDriver on Ubuntu.
commoncrawl / Cc PysparkProcess Common Crawl data with Python and Spark
Florents-Tselai / WarcDBWarcDB: Web crawl data as SQLite databases.
shaohua0116 / ICLR2019 OpenReviewDataScript that crawls meta data from ICLR OpenReview webpage. Tutorials on installing and using Selenium and ChromeDriver on Ubuntu.
cloudfour / Lighthouse ParadeA Node.js command line tool that crawls a domain and gathers lighthouse performance data for every page.
opensemanticsearch / Open Semantic EtlPython based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & Named Entity Recognition) & data enrichment (annotation) pipelines & ingestor to Solr or Elastic search index & linked data graph database
techenthusiast167 / WebReconWebRecon is an advanced Open Source Intelligence (OSINT) web reconnaissance tool designed for cybersecurity professionals, penetration testers, and security researchers. It automates the process of gathering intelligence from target websites through comprehensive crawling, data extraction, and analysis.
0xMassi / WebclawFast, local-first web content extraction for LLMs. Scrape, crawl, extract structured data — all from Rust. CLI, REST API, and MCP server.
duckduckgo / Tracker Radar DetectorCode used to build a Tracker Radar data set from raw crawl data.
vectara / Vectara IngestAn open source framework to crawl data sources and ingest into Vectara
karust / GogetcrawlExtract web archive data using Wayback Machine and Common Crawl