Moviestills
A small CLI app to scrap high-quality movie snapshots from various websites.
Install / Use
/learn @kinoute/MoviestillsREADME
Installation
There are various ways to install or use the application:
Binaries
Download the latest binary from the releases page for your OS. Then you can simply execute the binary like this:
# list all the implemented scrapers
./moviestills --list
# You can also use environment variables
# instead of CLI arguments.
# scrap the blubeaver website with default settings
WEBSITE=blubeaver ./moviestills
See Usage to check what settings you can pass through CLI arguments or environment variables.
Docker images
Docker comes to the rescue, providing an easy way how to run moviestills on most platforms.
GitHub Registry
The "latest" image is built from the master branch on every push. You can see all the other tags (releases) available here.
docker run \
--name moviestills \
--pull=always \
--volume "${PWD}/cache:/app/cache" \
--volume "${PWD}/data:/app/data" \
--rm ghcr.io/kinoute/moviestills:latest \
--website movie-screencaps \
--async
Docker Hub
You can can see all the image tags available on the Docker Hub here.
docker run \
--name moviestills \
--pull=always \
--volume "${PWD}/cache:/app/cache" \
--volume "${PWD}/data:/app/data" \
--env WEBSITE=blubeaver \
--env ASYNC=true \
--rm hivacruz/moviestills:latest
As you can see, you can also use environment variables instead of CLI arguments.
By default, the docker run command above will always pull before running to get the latest image changes for the specified tag. If you don't like this behavior, remove the --pull=always flag from the command.
Usage
Output of ./moviestills --help:
Usage: moviestills [--website WEBSITE] [--list] [--parallel PARALLEL] [--delay DELAY] [--async] [--timeout TIMEOUT] [--proxy PROXY] [--cache-dir CACHE-DIR] [--data-dir DATA-DIR] [--hash] [--debug] [--no-colors] [--no-style]
Options:
--website WEBSITE, -w WEBSITE
Website to scrap movie stills on [env: WEBSITE]
--list, -l List all available scrapers implemented [default: false, env: LIST]
--parallel PARALLEL, -p PARALLEL
Limit the maximum parallelism [default: 2, env: PARALLEL]
--delay DELAY, -r DELAY
Add some random delay between requests [default: 1s, env: RANDOM_DELAY]
--async, -a Enable asynchronus running jobs [default: false, env: ASYNC]
--timeout TIMEOUT, -t TIMEOUT
Set the default request timeout for the scraper [default: 15s, env: TIMEOUT]
--proxy PROXY, -x PROXY
The proxy URL to use for scraping [env: PROXY]
--cache-dir CACHE-DIR, -c CACHE-DIR
Where to cache scraped websites pages [default: cache, env: CACHE_DIR]
--data-dir DATA-DIR, -f DATA-DIR
Where to store movie snapshots [default: data, env: DATA_DIR]
--hash Hash image filenames with md5 [default: false, env: HASH]
--debug, -d Set Log Level to Debug to see everything [default: false, env: DEBUG]
--no-colors Disable colors from output [default: false, env: NO_COLORS]
--no-style Disable styling and colors entirely from output [default: false, env: NO_STYLE]
--help, -h display this help and exit
--version display version and exit
Note: CLI arguments will always override environment variables. Therefore, if you set WEBSITE as an environment variable and also use —website as a CLI argument, only the latter will be passed to the app.
For boolean arguments such as --async or --debug, their equivalent as environment variables is, for example, ASYNC=false or ASYNC=true.
Examples
The simple goal of this CLI app is to scrap movie snapshots from a few supported websites. You can use either the binary or the Docker image to run moviestills.
See supported websites
To see the list of supported websites on which we can download movie snapshots, you can do:
# see all supported websites
./moviestills --list
# or
LIST=true ./moviestills
The list of supported websites with way more details can also be seen below.
Scrap a website with asynchronous jobs
Asynchronous jobs make the scraping faster. You can also increase the maximum number of simultaneous requests by changing the --parallel flag or the PARALLEL environment variable (set to 2 by default).
# with CLI arguments
./moviestills --website film-grab --async
# with environment variables
WEBSITE=film-grab ASYNC=true ./moviestills
# increase to 10 the maximum number of simultaneous requests
./moviestills --website dvdbeaver --async --parallel 10
Docker, hashed filenames, random delays and no colors
A complete example that scraps a website with:
- Docker ;
- Asynchronous jobs ;
- Random delays between requests ;
- Hashed filenames ;
- No colors on the terminal output.
# docker with CLI arguments
docker run \
--name moviestills \
--pull=always \
--volume "${PWD}/cache:/app/cache" \
--volume "${PWD}/data:/app/data" \
--rm ghcr.io/kinoute/moviestills:latest \
--website movie-screencaps \
--async \
--delay 5s \
--hash \
--no-colors
With Docker, settings can also be set with environment variables instead of CLI arguments with the —env or -e flag.
Proxies
You can set up a proxy URL to use for scraping using the --proxy CLI agument or the PROXY environment variable. At the moment, you can set only one proxy but the app might support multiple proxies in a round robin fashion later.
Cache
By default, every scraped page will be cached in the cache folder. You can change the name or path to the folder through the options with —cache-dir or the CACHE_DIR environment variable. This is an important folder as it stores everything that was scraped.
It avoids requesting again some websites pages when there is no need to. It is a nice thing as we don't want to flood these websites with thousands of useless requests. It is also handy to continue an early-stopped scraping job.
In case you are using our Docker image to run moviestills, don't forget to change the volume path to the new internal cache folder, if you set up a custom internal cache folder. But you should not bother editing this internal cache folder anyway, since you have volumes and can set the desired path on your host machine for the cache folder.
Data
By default, each scraped website will have its own subfolder in the data folder. Inside, every movie will have its own folder with the scraped movie snapshots found on the website.
Example:
data # where to store movie snapshots
├── blubeaver # website's name
│ ├── 12\ Angry\ Men # movie's title
│ │ ├── film3_blu_ray_reviews55_12_angry_men_blu_ray_large_large_12_angry_men_blu_ray_1.jpg
│ │ ├── film3_blu_ray_reviews55_12_angry_men_blu_ray_large_large_12_angry_men_blu_ray_2.jpg
│ │ ├── film3_blu_ray_reviews55_12_angry_men_blu_ray_large_large_12_angry_men_blu_ray_3.jpg
You can change the default data folder with the —data-dir CLI argument or the DATA_DIR environment variable.
If you use our Docker image to run moviestills, don't forget to change the volume path in case you edited the internal data folder. Again, you should not even bother editing the internal data folder's path or name anyway as you have volumes to store and get access to these files on the host machine.
Hash filenames
To get some consistency, you can use the MD5 hash function to normalize image filenames. All images will then use 32 hexadecimal digits as filenames. To enable the hashing, use the —hash CLI argument or the HASH=true environment variable.
Supported Websites
As today, scrapers were implemented for the following websites in moviestills:
| Website | Simplified Name <sup>1</sup> | Description | Movies <sup>3</sup> | | ---------------------------------------------- | ------------------------------- | ------------------------------------------------------------ | ---------------- | | BluBeaver | blubeaver | Extension of DVDBeaver, this time dedicated to Blu-Ray reviews only. Re
