DiffusionDB <a href="https://huggingface.co/datasets/poloclub/diffusiondb"><picture><source media="(prefers-color-scheme: dark)" srcset="https://i.imgur.com/yGxUUlX.png"><img src="favicon.ico" align="right" src="favicon.ico" height="40px"></picture>

<img width="100%" src="https://user-images.githubusercontent.com/15007159/201762588-f24db2b8-dbb2-4a94-947b-7de393fc3d33.gif">

DiffusionDB is the first large-scale text-to-image prompt dataset. It contains 14 million images generated by Stable Diffusion using prompts and hyperparameters specified by real users. The unprecedented scale and diversity of this human-actuated dataset provide exciting research opportunities in understanding the interplay between prompts and generative models, detecting deepfakes, and designing human-AI interaction tools to help users more easily use these models.

Get Started

DiffusionDB is available at 🤗 Hugging Face Datasets.

Two Subsets

DiffusionDB provides two subsets (DiffusionDB 2M and DiffusionDB Large) to support different needs.

|Subset|Num of Images|Num of Unique Prompts|Size|Image Directory|Metadata Table| |:--|--:|--:|--:|--:|--:| |DiffusionDB 2M|2M|1.5M|1.6TB|images/|metadata.parquet| |DiffusionDB Large|14M|1.8M|6.5TB|diffusiondb-large-part-1/ diffusiondb-large-part-2/|metadata-large.parquet|

Key Differences

Two subsets have a similar number of unique prompts, but DiffusionDB Large has much more images. DiffusionDB Large is a superset of DiffusionDB 2M.
Images in DiffusionDB 2M are stored in png format; images in DiffusionDB Large use a lossless webp format.

Dataset Structure

We use a modularized file structure to distribute DiffusionDB. The 2 million images in DiffusionDB 2M are split into 2,000 folders, where each folder contains 1,000 images and a JSON file that links these 1,000 images to their prompts and hyperparameters. Similarly, the 14 million images in DiffusionDB Large are split into 14,000 folders.

# DiffusionDB 2M
./
├── images
│   ├── part-000001
│   │   ├── 3bfcd9cf-26ea-4303-bbe1-b095853f5360.png
│   │   ├── 5f47c66c-51d4-4f2c-a872-a68518f44adb.png
│   │   ├── 66b428b9-55dc-4907-b116-55aaa887de30.png
│   │   ├── [...]
│   │   └── part-000001.json
│   ├── part-000002
│   ├── part-000003
│   ├── [...]
│   └── part-002000
└── metadata.parquet

# DiffusionDB Large
./
├── diffusiondb-large-part-1
│   ├── part-000001
│   │   ├── 0a8dc864-1616-4961-ac18-3fcdf76d3b08.webp
│   │   ├── 0a25cacb-5d91-4f27-b18a-bd423762f811.webp
│   │   ├── 0a52d584-4211-43a0-99ef-f5640ee2fc8c.webp
│   │   ├── [...]
│   │   └── part-000001.json
│   ├── part-000002
│   ├── part-000003
│   ├── [...]
│   └── part-010000
├── diffusiondb-large-part-2
│   ├── part-010001
│   │   ├── 0a68f671-3776-424c-91b6-c09a0dd6fc2d.webp
│   │   ├── 0a0756e9-1249-4fe2-a21a-12c43656c7a3.webp
│   │   ├── 0aa48f3d-f2d9-40a8-a800-c2c651ebba06.webp
│   │   ├── [...]
│   │   └── part-010001.json
│   ├── part-010002
│   ├── part-010003
│   ├── [...]
│   └── part-014000
└── metadata-large.parquet

These sub-folders have names part-0xxxxx, and each image has a unique name generated by UUID Version 4. The JSON file in a sub-folder has the same name as the sub-folder. Each image is a PNG file (DiffusionDB 2M) or a lossless WebP file (DiffusionDB Large). The JSON file contains key-value pairs mapping image filenames to their prompts and hyperparameters. For example, below is the image of f3501e05-aef7-4225-a9e9-f516527408ac.png and its key-value pair in part-000001.json.

{
  "f3501e05-aef7-4225-a9e9-f516527408ac.png": {
    "p": "geodesic landscape, john chamberlain, christopher balaskas, tadao ando, 4 k, ",
    "se": 38753269,
    "c": 12.0,
    "st": 50,
    "sa": "k_lms"
  },
}

The data fields are:

key: Unique image name
p: Prompt
se: Random seed
c: CFG Scale (guidance scale)
st: Steps
sa: Sampler

Dataset Metadata

To help you easily access prompts and other attributes of images without downloading all the Zip files, we include two metadata tables metadata.parquet and metadata-large.parquet for DiffusionDB 2M and DiffusionDB Large, respectively.

The shape of metadata.parquet is (2000000, 13) and the shape of metatable-large.parquet is (14000000, 13). Two tables share the same schema, and each row represents an image. We store these tables in the Parquet format because Parquet is column-based: you can efficiently query individual columns (e.g., prompts) without reading the entire table.

Below are three random rows from metadata.parquet.

| image_name | prompt | part_id | seed | step | cfg | sampler | width | height | user_name | timestamp | image_nsfw | prompt_nsfw | |:-----------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------:|-----------:|-------:|------:|----------:|--------:|---------:|:-----------------------------------------------------------------|:--------------------------|-------------:|--------------:| | 0c46f719-1679-4c64-9ba9-f181e0eae811.png | a small liquid sculpture, corvette, viscous, reflective, digital art | 1050 | 2026845913 | 50 | 7 | 8 | 512 | 512 | c2f288a2ba9df65c38386ffaaf7749106fed29311835b63d578405db9dbcafdb | 2022-08-11 09:05:00+00:00 | 0.0845108 | 0.00383462 | | a00bdeaa-14eb-4f6c-a303-97732177eae9.png | human sculpture of lanky tall alien on a romantic date at italian restaurant with smiling woman, nice restaurant, photography, bokeh | 905 | 1183522603 | 50 | 10 | 8 | 512 | 768 | df778e253e6d32168eb22279a9776b3cde107cc82da05517dd6d114724918651 | 2022-08-19 17:55:00+00:00 | 0.692934 | 0.109437 | | 6e5024ce-65ed-47f3-b296-edb2813e3c5b.png | portrait of barbaric spanish conquistador, symmetrical, by yoichi hatakenaka, studio ghibli and dan mumford | 286 | 1713292358 | 50 | 7 | 8 | 512 | 640 | 1c2e93cfb1430adbd956be9c690705fe295cbee7d9ac12de1953ce5e76d89906 | 2022-08-12 03:26:00+00:00 | 0.0773138 | 0.0249675 |

Metadata Schema

metadata.parquet and metatable-large.parquet share the same schema.

|Column|Type|Description| |:---|:---|:---| |image_name|string|Image UUID filename.| |prompt|string|The text prompt used to generate this image.| |part_id|uint16|Folder ID of this image.| |seed|uint32| Random seed used to generate this image.| |step|uint16| Step count (hyperparameter).| |cfg|float32| Guidance scale (hyperparameter).| |sampler|uint8| Sampler method (hyperparameter). Mapping: {1: "ddim", 2: "plms", 3: "k_euler", 4: "k_euler_ancestral", 5: "k_heun", 6: "k_dpm_2", 7: "k_dpm_2_ancestral", 8: "k_lms", 9: "others"}. |width|uint16|Image width.| |height|uint16|Image height.| |user_name|string|The unique discord ID's SHA256 hash of the user who generated this image. For example, the hash for xiaohk#3146 is e285b7ef63be99e9107cecd79b280bde602f17e0ca8363cb7a0889b67f0b5ed0. "deleted_account" refer to users who have deleted their accounts. None means the image has been deleted before we scrape it for the second time

Diffusiondb

Install / Use

README