Webscraper
A multi-threaded web image scraper written in C that extracts and downloads all images from a webpage. Uses libcurl for HTTP requests, libxml2 for HTML parsing, and POSIX threads for parallel downloads.
Install / Use
/learn @7etsuo/WebscraperREADME
Web Image Scraper
$TETSUO on Solana
Contract Address: 8i51XNNpGaKaj4G4nDdmQh95v4FKAxw8mhtaRoKd9tE8
A fast, multi-threaded utility to extract and download all images from a webpage.
Features
- Extracts all image URLs from a target webpage
- Resolves relative URLs to absolute URLs
- Avoids duplicate downloads with O(1) lookup
- Uses multiple threads for parallel downloading
- Preserves original file extensions when possible
Requirements
- libcurl (HTTP requests)
- libxml2 (HTML parsing)
- POSIX threads
Installation
Ubuntu/Debian
sudo apt-get install libcurl4-openssl-dev libxml2-dev
Fedora/RHEL/CentOS
sudo dnf install libcurl-devel libxml2-devel
macOS (with Homebrew)
brew install curl libxml2
Compilation
gcc -o webscraper webscraper.c $(curl-config --cflags --libs) $(xml2-config --cflags --libs) -pthread
Usage
./webscraper <url>
Example:
./webscraper https://example.com
Downloaded images will be saved in the downloaded_images directory.
License
MIT
