Img2dataset
Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
Install / Use
/learn @rom1504/Img2datasetREADME
img2dataset
Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
Also supports saving captions for url+caption datasets.
If you believe in making reusable tools to make data easy to use for ML and you would like to contribute, please join the DataToML chat.
Install
pip install img2dataset
For better performance, it's highly recommended to set up a fast dns resolver, see this section
Opt-out directives
Websites can pass the http headers X-Robots-Tag: noai, X-Robots-Tag: noindex , X-Robots-Tag: noimageai and X-Robots-Tag: noimageindex
By default img2dataset will ignore images with such headers.
To disable this behavior and download all images, you may pass --disallowed_header_directives '[]'
See AI use impact to understand better why you may decide to enable or disable this feature.
Examples
Example of datasets to download with example commands are available in the dataset_examples folder. In particular:
- mscoco 600k image/text pairs that can be downloaded in 10min
- sbucaptions 860K image/text pairs can be downloaded in 20 mins.
- cc3m 3M image/text pairs that can be downloaded in one hour
- cc12m 12M image/text pairs that can be downloaded in five hour
- laion400m 400M image/text pairs that can be downloaded in 3.5 days
- laion5B 5B image/text pairs that can be downloaded in 7 days using 10 nodes
- laion-aesthetic Laion aesthetic is a 120M laion5B subset with aesthetic > 7 pwatermark < 0.8 punsafe < 0.5
- laion-art Laion aesthetic is a 8M laion5B subset with aesthetic > 8 pwatermark < 0.8 punsafe < 0.5
- laion-coco Laion-COCO is a 600M subset of LAION2B-EN, captioned with an ensemble of BLIP L/14 and 2 CLIP versions (L/14 and RN50x64).
- laion-high-resolution Laion high resolution is a 170M resolution >= 1024x1024 subset of laion5B
- laion-face Laion face is the human face subset of LAION-400M for large-scale face pretraining. It has 50M image-text pairs.
- coyo-700m COYO is a large-scale dataset that contains 747M image-text pairs as well as many other meta-attributes to increase the usability to train various models.
- commonpool CommonPool is a large-scale dataset collected from CommonCrawl containing 12.8B image-text pairs.
- datacomp-1b DataComp-1B is a large-scale dataset with 1.4B image-text pairs filtered from CommonPool.
For all these examples, you may want to tweak the resizing to your preferences. The default is 256x256 with white borders. See options below.
Usage
First get some image url list. For example:
echo 'https://picsum.photos/200/305' >> myimglist.txt
echo 'https://picsum.photos/200/304' >> myimglist.txt
echo 'https://picsum.photos/200/303' >> myimglist.txt
Then, run the tool:
img2dataset --url_list=myimglist.txt --output_folder=output_folder --thread_count=64 --image_size=256
The tool will then automatically download the urls, resize them, and store them with that format:
- output_folder
- 00000
- 000000000.jpg
- 000000001.jpg
- 000000002.jpg
- 00000
or as this format if choosing webdataset:
- output_folder
- 00000.tar containing:
- 000000000.jpg
- 000000001.jpg
- 000000002.jpg
- 00000.tar containing:
with each number being the position in the list. The subfolders avoids having too many files in a single folder.
If captions are provided, they will be saved as 0.txt, 1.txt, ...
This can then easily be fed into machine learning training or any other use case.
Also .json files named 0.json, 1.json,... are saved with these keys:
- url
- caption
- key of the form 000010005 : the first 5 digits are the shard id, the last 4 are the index in the shard
- status : whether the download succeeded
- error_message
- width
- height
- original_width
- original_height
- exif
Also a .parquet file will be saved with the same name as the subfolder/tar files containing these same metadata. It can be used to analyze the results efficiently.
.json files will also be saved with the same name suffixed by _stats, they contain stats collected during downloading (download time, number of success, ...)
Python examples
Checkout these examples to call this as a lib:
API
This module exposes a single function download which takes the same arguments as the command line tool:
- url_list A file with the list of url of images to download. It can be a folder of such files. (required)
- image_size The size to resize image to (default 256)
- output_folder The path to the output folder. (default "images")
- processes_count The number of processes used for downloading the pictures. This is important to be high for performance. (default 1)
- thread_count The number of threads used for downloading the pictures. This is important to be high for performance. (default 256)
- resize_mode The way to resize pictures, can be no, border or keep_ratio (default border)
- no doesn't resize at all
- border will make the image image_size x image_size and add a border
- keep_ratio will keep the ratio and make the smallest side of the picture image_size
- keep_ratio_largest will keep the ratio and make the largest side of the picture image_size
- center_crop will keep the ratio and center crop the largest side so the picture is squared
- resize_only_if_bigger resize pictures only if bigger that the image_size (default False)
- upscale_interpolation kind of upscale interpolation used for resizing (default "lanczos")
- downscale_interpolation kind of downscale interpolation used for resizing (default "area")
- encode_quality encode quality from 0 to 100, when using png it is the compression factor from 0 to 9 (default 95)
- encode_format encode format (default jpg)
- jpg jpeg format
- png png format
- webp webp format
- skip_reencode whether to skip reencoding if no resizing is done (default False)
- output_format decides how to save pictures (default files)
- files saves as a set of subfolder containing pictures
- webdataset saves as tars containing pictures
- parquet saves as parquet containing pictures as bytes
- tfrecord saves as tfrecord containing pictures as bytes
- dummy does not save. Useful for benchmarks
- input_format decides how to load the urls (default txt)
- txt loads the urls as a text file of url, one per line
- txt.gz loads the urls as a compressed (gzip) txt.gz with a list of url, one per line
- csv loads the urls and optional caption as a csv
- csv.gz loads the urls and optional caption, as a compressed (gzip) csv.gz
- tsv loads the urls and optional caption as a tsv
- tsv.gz loads the urls and optional caption, as a compressed (gzip) tsv.gz
- json loads the urls and optional caption as a json
- json.gz loads the urls and optional caption, as a compressed (gzip) json.gz
- jsonl loads the urls and optional caption as a jsonl. see jsonlines for more
- jsonl.gz loads the urls and optional caption, as a compressed (gzip) jsonl.gz. see jsonlines for more
- parquet loads the urls and optional caption as a parquet
- url_col the name of the url column for parquet and csv (default url)
- caption_col the name of the caption column for parquet and csv (default None)
- bbox_col the name of the bounding box column. Bounding boxes are assumed to have format
[x_min, y_min, x_max, y_max], with all elements being floats in [0,1] (relative to the size of the image). If None, then no bounding box blurring is performed (default None) - number_sample_per_shard the number of sample that will be downloaded in one shard (default 10000)
- extract_exif if true, extract the exif information of the images and save it to the metadata (default True)
- save_additional_columns list of additional columns to take from the csv/parquet files and save in metadata files (default None)
- timeout maximum time (in seconds) to wait when trying to download an image (default 10)
- enable_wandb whether to enable wandb logging (default False)
- wandb_project name of W&B project used (default img2dataset)
- oom_shard_count the order of magnitude of the number of shards, used only to decide what zero padding to use to name the shard files (default 5)
- compute_hash the hash of raw images to compute and store in the metadata, one of None, md5, sha256, sha512 (de
Related Skills
healthcheck
338.7kHost security hardening and risk-tolerance configuration for OpenClaw deployments
imsg
338.7kiMessage/SMS CLI for listing chats, history, and sending messages via Messages.app.
xurl
338.7kA CLI tool for making authenticated requests to the X (Twitter) API. Use this skill when you need to post tweets, reply, quote, search, read posts, manage followers, send DMs, upload media, or interact with any X API v2 endpoint.
peekaboo
338.7kCapture and automate macOS UI with the Peekaboo CLI.
