Scrapper

</div>

Scrapper is a web scraper tool designed to download web pages and extract articles in a structured format. The application combines functionality from several open-source projects to provide an effective solution for web content extraction.

Quick start

Start a Scrapper instance with:

docker run -d -p 3000:3000 --name scrapper amerkurev/scrapper:latest

Scrapper will be available at http://localhost:3000/. For more details, see Usage

Demo

Watch a 30-second demo reel showcasing the web interface of Scrapper.

https://user-images.githubusercontent.com/28217522/225941167-633576fa-c9e2-4c63-b1fd-879be2d137fa.mp4

Features

Scrapper provides the following features:

Built-in headless browser - Integrates with Playwright to handle JavaScript-heavy websites, cookie consent forms, and other interactive elements.
Read mode parsing - Uses Mozilla's Readability.js library to extract article content similar to browser "Reader View" functionality.
Web interface - Provides a user-friendly interface for debugging queries and experimenting with parameters. Built with the Pico CSS framework with dark theme support.
Simple REST API - Features a straightforward API requiring minimal parameters for integration.
News link extraction - Identifies and extracts links to news articles from website main pages.

Additional capabilities include:

Result caching - Caches parsing results to disk for faster retrieval.
Page screenshots - Captures visual representation of pages as seen by the parser.
Session management - Configurable incognito mode or persistent sessions.
Proxy support - Compatible with HTTP, SOCKS4, and SOCKS5 proxies.
Customization options - Control for HTTP headers, viewport settings, Readability parser parameters, and more.
Docker delivery - Packaged as a Docker image for simple deployment.
Open-source license - Available under MIT license.

Usage

Getting Scrapper

The Scrapper Docker image includes Playwright and all necessary browser dependencies, resulting in an image size of approximately 2 GB. Ensure sufficient disk space is available, particularly if storing screenshots.

To download the latest version:

docker pull amerkurev/scrapper:latest

Creating directories

Scrapper requires two directories:

user_data: Stores browser session data and caches parsing results
user_scripts: Contains custom JavaScript scripts that can be injected into pages

Scrapper runs under UID 1001 rather than root. Set appropriate permissions on mounted directories:

mkdir -p user_data user_scripts
chown 1001:1001 user_data/ user_scripts/
ls -l

The output should show:

drwxr-xr-x 2 1001 1001 4096 Mar 17 23:23 user_data
drwxr-xr-x 2 1001 1001 4096 Mar 17 23:23 user_scripts

Important note for macOS users

If you're running Scrapper on macOS, do not set ownership to UID 1001:1001 for the directories. Simply create the folders and Scrapper will work with your current user permissions:
mkdir -p user_data user_scripts
Setting chown 1001:1001 on macOS will prevent Scrapper from writing to these directories!

Managing Scrapper Cache

The Scrapper cache is stored in the user_data/_res directory. For automated cache management, configure periodic cleanup:

find /path/to/user_data/_res -ctime +7 -delete

This example deletes cache files older than 7 days.

Using Scrapper

After preparing directories, run Scrapper:

docker run -d -p 3000:3000 -v $(pwd)/user_data:/home/pwuser/user_data -v $(pwd)/user_scripts:/home/pwuser/user_scripts --name scrapper amerkurev/scrapper:latest

Access the web interface at http://localhost:3000/

Monitor logs with:

docker logs -f scrapper

Configuration Options

Scrapper can be configured using environment variables. You can set these either directly when running the container or through an environment file passed with --env-file=.env.

| Environment Variable | Description | Default | | ------------------------- | ------------------------------------------------------------------ | --------------------- | | HOST | Interface address to bind the server to | 0.0.0.0 | | PORT | Web interface port number | 3000 | | LOG_LEVEL | Logging detail level (debug, info, warning, error, critical) | info | | BASIC_HTPASSWD | Path to the htpasswd file for basic authentication | /.htpasswd | | BROWSER_TYPE | Browser type to use (chromium, firefox, webkit) | chromium | | BROWSER_CONTEXT_LIMIT | Maximum number of browser contexts (tabs) | 20 | | SCREENSHOT_TYPE | Screenshot type (jpeg or png) | jpeg | | SCREENSHOT_QUALITY | Screenshot quality (0-100) | 80 | | UVICORN_WORKERS | Number of web server worker processes | 2 | | DEBUG | Enable debug mode | false |

Example .env file

LOG_LEVEL=error
BROWSER_TYPE=firefox
SCREENSHOT_TYPE=jpeg
SCREENSHOT_QUALITY=90
UVICORN_WORKERS=4
DEBUG=false

To use an environment file with Docker, include it when running the container:

docker run -d --name scrapper --env-file=.env -v $(pwd)/user_data:/home/pwuser/user_data -v $(pwd)/user_scripts:/home/pwuser/user_scripts -p 3000:3000 amerkurev/scrapper:latest

Basic Authentication

Scrapper supports HTTP basic authentication to secure access to the web interface. Follow these steps to enable it:

Create an htpasswd file with bcrypt-encrypted passwords:

htpasswd -cbB .htpasswd admin yourpassword

Add additional users with:

htpasswd -bB .htpasswd another_user anotherpassword

Mount the htpasswd file when running Scrapper:

docker run -d --name scrapper \
    -v $(pwd)/user_data:/home/pwuser/user_data \
    -v $(pwd)/user_scripts:/home/pwuser/user_scripts \
    -v $(pwd)/.htpasswd:/.htpasswd \
    -p 3000:3000 \
    amerkurev/scrapper:latest

If you want to use a custom path for the htpasswd file, specify it with the BASIC_HTPASSWD environment variable:

docker run -d --name scrapper \
    -v $(pwd)/user_data:/home/pwuser/user_data \
    -v $(pwd)/user_scripts:/home/pwuser/user_scripts \
    -v $(pwd)/custom/path/.htpasswd:/auth/.htpasswd \
    -e BASIC_HTPASSWD=/auth/.htpasswd \
    -p 3000:3000 \
    amerkurev/scrapper:latest

Authentication will be required for all requests to Scrapper once enabled.

HTTPS Support

Scrapper supports HTTPS connections with SSL certificates for secure access to the web interface. Follow these steps to enable it:

Prepare your SSL certificate and key files:

# Example of generating a self-signed certificate (for testing only)
openssl req -x509 -newkey rsa:4096 -nodes -keyout key.pem -out cert.pem -days 365 -subj '/CN=localhost'

Mount the SSL files when running Scrapper:

docker run -d --name scrapper \
    -v $(pwd)/user_data:/home/pwuser/user_data \
    -v $(pwd)/user_scripts:/home/pwuser/user_scripts \
    -v $(pwd)/cert.pem:/.ssl/cert.pem \
    -v $(pwd)/key.pem:/.ssl/key.pem \
    -p 3000:3000 \
    amerkurev/scrapper:latest

When SSL certificates are detected, Scrapper automatically enables HTTPS mode. You can then access the secure interface at https://localhost:3000/.

For production use, always use properly signed certificates from a trusted certificate authority.

API Reference

GET /api/article?url=...

The Scrapper API provides a straightforward interface accessible through a single endpoint:

curl -X GET "localhost:3000/api/article?url=https://en.wikipedia.org/wiki/web_scraping"

Use the GET method on the /api/article endpoint with the required url parameter specifying the target webpage. Scrapper will load the page in a browser, extract the article text, and return it in JSON format.

All other parameters are optional with default values. The web interface provides a visual query builder to assist with parameter configuration.

Request Parameters

Scrapper settings

| Parameter | Description | Default | | :-------------------------- | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :-------- | | url | Page URL. The page should contain the text of the article that needs to be extracted. | | | cache | All scraping results are always saved to disk. This parameter determines whether to retrieve results from cache or execute a new request. When set to true, existing cached results will be returned if available. By default, cache reading is disabled, so each request is processed anew. | false | | full-content | If this option is set to true, the result will have the full HTML conte

Scrapper

Install / Use

README

Scrapper

Quick start

Demo

Features

Usage

Getting Scrapper

Creating directories

Managing Scrapper Cache

Using Scrapper

Configuration Options

Example .env file

Basic Authentication

HTTPS Support

API Reference

GET /api/article?url=...

Request Parameters

Scrapper settings