<h1 align="center"> <img src="https://raw.githubusercontent.com/kinoute/moviestills/master/.github/images/banner.png" alt="movie stills" height="200px"> <br> Movie Stills </br> </h1> <h4 align="center"> A Go CLI application to scrap various websites in order to get high-quality movie snapshots. See the list of <a href="#supported-websites">Supported Websites</a>. </h4> <p align="center"> <a href="https://github.com/kinoute/moviestills/actions/workflows/ci.yml"> <img src="https://github.com/kinoute/moviestills/actions/workflows/ci.yml/badge.svg" alt="CI" style="max-width: 100%;"> </a> <a href="https://goreportcard.com/report/github.com/kinoute/moviestills"> <img src="https://goreportcard.com/badge/github.com/kinoute/moviestills?style=flat-square" alt="Go Report Card"> </a> <a href="https://github.com/kinoute/moviestills/actions/workflows/test-scrapers.yml"> <img src="https://github.com/kinoute/moviestills/actions/workflows/test-scrapers.yml/badge.svg" alt="Test Scrapers" /> </a> <a href="https://hub.docker.com/r/hivacruz/moviestills"> <img alt="Docker Image Size" src="https://img.shields.io/docker/image-size/hivacruz/moviestills/latest?label=Docker%20Image"> </a> </p>

Installation

There are various ways to install or use the application:

Binaries

Download the latest binary from the releases page for your OS. Then you can simply execute the binary like this:

# list all the implemented scrapers
./moviestills --list

# You can also use environment variables 
# instead of CLI arguments.
# scrap the blubeaver website with default settings
WEBSITE=blubeaver ./moviestills

See Usage to check what settings you can pass through CLI arguments or environment variables.

Docker images

Docker comes to the rescue, providing an easy way how to run moviestills on most platforms.

GitHub Registry

The "latest" image is built from the master branch on every push. You can see all the other tags (releases) available here.

docker run \
    --name moviestills \
    --pull=always \
    --volume "${PWD}/cache:/app/cache" \
    --volume "${PWD}/data:/app/data" \
    --rm ghcr.io/kinoute/moviestills:latest \
    --website movie-screencaps \
    --async

Docker Hub

You can can see all the image tags available on the Docker Hub here.

docker run \
    --name moviestills \
    --pull=always \
    --volume "${PWD}/cache:/app/cache" \
    --volume "${PWD}/data:/app/data" \
    --env WEBSITE=blubeaver \
    --env ASYNC=true \
    --rm hivacruz/moviestills:latest

As you can see, you can also use environment variables instead of CLI arguments.

By default, the docker run command above will always pull before running to get the latest image changes for the specified tag. If you don't like this behavior, remove the --pull=always flag from the command.

Usage

Output of ./moviestills --help:

Usage: moviestills [--website WEBSITE] [--list] [--parallel PARALLEL] [--delay DELAY] [--async] [--timeout TIMEOUT] [--proxy PROXY] [--cache-dir CACHE-DIR] [--data-dir DATA-DIR] [--hash] [--debug] [--no-colors] [--no-style]

Options:
  --website WEBSITE, -w WEBSITE
                         Website to scrap movie stills on [env: WEBSITE]
  --list, -l             List all available scrapers implemented [default: false, env: LIST]
  --parallel PARALLEL, -p PARALLEL
                         Limit the maximum parallelism [default: 2, env: PARALLEL]
  --delay DELAY, -r DELAY
                         Add some random delay between requests [default: 1s, env: RANDOM_DELAY]
  --async, -a            Enable asynchronus running jobs [default: false, env: ASYNC]
  --timeout TIMEOUT, -t TIMEOUT
                         Set the default request timeout for the scraper [default: 15s, env: TIMEOUT]
  --proxy PROXY, -x PROXY
                         The proxy URL to use for scraping [env: PROXY]
  --cache-dir CACHE-DIR, -c CACHE-DIR
                         Where to cache scraped websites pages [default: cache, env: CACHE_DIR]
  --data-dir DATA-DIR, -f DATA-DIR
                         Where to store movie snapshots [default: data, env: DATA_DIR]
  --hash                 Hash image filenames with md5 [default: false, env: HASH]
  --debug, -d            Set Log Level to Debug to see everything [default: false, env: DEBUG]
  --no-colors            Disable colors from output [default: false, env: NO_COLORS]
  --no-style             Disable styling and colors entirely from output [default: false, env: NO_STYLE]
  --help, -h             display this help and exit
  --version              display version and exit

Note: CLI arguments will always override environment variables. Therefore, if you set WEBSITE as an environment variable and also use —website as a CLI argument, only the latter will be passed to the app.

For boolean arguments such as --async or --debug, their equivalent as environment variables is, for example, ASYNC=false or ASYNC=true.

Examples

The simple goal of this CLI app is to scrap movie snapshots from a few supported websites. You can use either the binary or the Docker image to run moviestills.

See supported websites

To see the list of supported websites on which we can download movie snapshots, you can do:

# see all supported websites
./moviestills --list
# or
LIST=true ./moviestills

The list of supported websites with way more details can also be seen below.

Scrap a website with asynchronous jobs

Asynchronous jobs make the scraping faster. You can also increase the maximum number of simultaneous requests by changing the --parallel flag or the PARALLEL environment variable (set to 2 by default).

# with CLI arguments
./moviestills --website film-grab --async

# with environment variables
WEBSITE=film-grab ASYNC=true ./moviestills

# increase to 10 the maximum number of simultaneous requests
./moviestills --website dvdbeaver --async --parallel 10

Docker, hashed filenames, random delays and no colors

A complete example that scraps a website with:

Docker ;
Asynchronous jobs ;
Random delays between requests ;
Hashed filenames ;
No colors on the terminal output.

# docker with CLI arguments
docker run \
    --name moviestills \
    --pull=always \
    --volume "${PWD}/cache:/app/cache" \
    --volume "${PWD}/data:/app/data" \
    --rm ghcr.io/kinoute/moviestills:latest \
    --website movie-screencaps \
    --async \
    --delay 5s \
    --hash \
    --no-colors

With Docker, settings can also be set with environment variables instead of CLI arguments with the —env or -e flag.

Proxies

You can set up a proxy URL to use for scraping using the --proxy CLI agument or the PROXY environment variable. At the moment, you can set only one proxy but the app might support multiple proxies in a round robin fashion later.

Cache

By default, every scraped page will be cached in the cache folder. You can change the name or path to the folder through the options with —cache-dir or the CACHE_DIR environment variable. This is an important folder as it stores everything that was scraped.

It avoids requesting again some websites pages when there is no need to. It is a nice thing as we don't want to flood these websites with thousands of useless requests. It is also handy to continue an early-stopped scraping job.

In case you are using our Docker image to run moviestills, don't forget to change the volume path to the new internal cache folder, if you set up a custom internal cache folder. But you should not bother editing this internal cache folder anyway, since you have volumes and can set the desired path on your host machine for the cache folder.

Data

By default, each scraped website will have its own subfolder in the data folder. Inside, every movie will have its own folder with the scraped movie snapshots found on the website.

Example:

data # where to store movie snapshots
├── blubeaver # website's name
│   ├── 12\ Angry\ Men # movie's title
│   │   ├── film3_blu_ray_reviews55_12_angry_men_blu_ray_large_large_12_angry_men_blu_ray_1.jpg
│   │   ├── film3_blu_ray_reviews55_12_angry_men_blu_ray_large_large_12_angry_men_blu_ray_2.jpg
│   │   ├── film3_blu_ray_reviews55_12_angry_men_blu_ray_large_large_12_angry_men_blu_ray_3.jpg

You can change the default data folder with the —data-dir CLI argument or the DATA_DIR environment variable.

If you use our Docker image to run moviestills, don't forget to change the volume path in case you edited the internal data folder. Again, you should not even bother editing the internal data folder's path or name anyway as you have volumes to store and get access to these files on the host machine.

Hash filenames

To get some consistency, you can use the MD5 hash function to normalize image filenames. All images will then use 32 hexadecimal digits as filenames. To enable the hashing, use the —hash CLI argument or the HASH=true environment variable.

Supported Websites

As today, scrapers were implemented for the following websites in moviestills:

| Website | Simplified Name <sup>1</sup> | Description | Movies <sup>3</sup> | | ---------------------------------------------- | ------------------------------- | ------------------------------------------------------------ | ---------------- | | BluBeaver | blubeaver | Extension of DVDBeaver, this time dedicated to Blu-Ray reviews only. Re

Moviestills

Install / Use

README