Datacomp

DataComp: In search of the next generation of multimodal datasets

Generate Convert Improve

Install / Use

/learn @mlfoundations/Datacomp

About this skill

Quality Score

0/100

README

DataComp

[ Paper ] [ Website ] [ Blog ]

Welcome to our competition. This repository contains the participant tooling necessary to download data from our pool, train CLIP models, evaluate them on downstream tasks and submit to our leaderboard.

Overview

DataComp is a competition about designing datasets for pre-training CLIP models. Instead of iterating on model design and hyperparameter tuning like in traditional benchmarks, in DataComp your task is to curate a multimodal pre-training dataset with image-text pairs that yields high accuracy on downstream tasks. Model architecture and hyperparameters are fixed allowing participants to innovate on the dataset design. As part of the benchmark, we provide a large collection of uncurated image-text pairs, crawled from the public internet.

Our benchmark offers two tracks: one where participants must use only samples from the pools we provide (filtering), and another where participants can use external data, including samples from our pool (Bring your own data, BYOD).

DataComp is structured to accommodate participants with diverse levels of computational resources: each track is broken down into four scales, with varying amounts of compute requirements.

An overview of our benchmark and participant workflow can be found below. For more information, check out our paper and website.

Installing dependencies

Run:

bash create_env.sh

To activate the environment:

conda activate datacomp

If using cloud storage services (e.g. AWS S3), you'll need to install additional dependencies (e.g. pip install 'cloudpathlib[s3]').

Downloading CommonPool

To download, run the following command, replacing $scale with the competition scale (i.e. small, medium, large or xlarge) and $data_dir with the output directory where you want the data to be stored.

python download_upstream.py --scale $scale --data_dir $data_dir

There are four scales in our competition:

small: 12.8M pool size, 12.8M examples seen
medium: 128M pool size, 128M examples seen
large: 1.28B pool size, 1.28B examples seen
xlarge: 12.8B pool size, 12.8B examples seen

The script will create two directories inside $data_dir: metadata and shards.

Along with the images and captions, this script will also download metadata, including .parquet files that contain the image urls, captions, and other potentially useful information such as the similarities between the images and captions given by trained OpenAI CLIP models. If the flag --download_npz is used, the script will also download the .npz files with features extracted by the trained OpenAI CLIP models for each sample.

We download the image data using img2dataset, which stores it as .tar shards with the images and captions to be consumed by webdataset. Once the download finishes, the data will be available at $data_dir/shards.

To download only metadata, use the --skip_shards flag.

The disk requirements for each scale are shown below.

| | metadata (parquets) | metadata (npzs) | data (tars) | | :-------------- | :-----------------: | :-------------: | :---------: | | small scale | 3 GB | 75GB | 450 GB | | medium scale | 30 GB | 750GB | 4.5 TB | | large scale | 300 GB | 7.5TB | 45 TB | | xlarge scale | 3 TB | 75TB | 450 TB |

Downloading DataComp-1B

The script download_upstream.py can be used to download the DataComp-1B dataset that we release as our best performing subset of the xlarge pool. To download this, use the following command:

python download_upstream.py --scale datacomp_1b --data_dir $data_dir

The above command will create the same directory structure under $data_dir and can be modified as described above.

Downloading external data

The script download_upstream.py can also be used to download other image-text datasets, using img2dataset. Given parquet files containing the image urls and captions, you can use this script to download the images, by using the flag --metadata_dir to point to the directory where the parquet files are stored. By default, we also download the parquet files corresponding to the pools we provide, and this metadata is stored in a subfolder of $data_dir.

Optimizing the download

When using img2dataset, there are several ways to optimize the download process such as using multiple nodes in a distributed environment or setting up a DNS resolver to increase the success rate of images being downloaded. See the img2dataset repository for further instructions on how to optimize the download process, as well as information on potential issues during the download.

Selecting samples in the filtering track

Before training, you will need to select the subset of samples you wish to use. Given a set of chosen samples, we create new shards with only those samples, which the training code then consumes. For each scale, models are trained for a fixed number of steps, regardless of the size of the chosen subset of the provided pool.

Each sample in our pool has a unique identifier, which is present in the metadata parquets, and in the json files inside the .tar shards.

The format describing the subset of samples should be a numpy array of dtype numpy.dtype("u8,u8") (i.e. a structured array of pairs of unsigned 64-bit integers), with shape (subset_size,), containing a list of uids (128-bit hashes from the parquet files) in lexicographic sorted order, saved to disk in either npy format or memory-mapped format.

For instance, if you have a list of uids uids = ['139e4a9b22a614771f06c700a8ebe150', '6e356964a967af455c8016b75d691203'], you can store them by running the following python code:

processed_uids = np.array([(int(uid[:16], 16), int(uid[16:32], 16)) for uid in uids], np.dtype("u8,u8"))
processed_uids.sort()
np.save(out_filename, processed_uids)

After creating a subset, you may invoke the resharder to build the subset shards in $output_dir like so:

python resharder.py -i $download_dir -o $output_dir -s $subset_file

If desired, the resharder can be run in parallel on multiple nodes. The easiest way to do so is to split the input directory into smaller subfolders with fewer shards, and run separate resharder jobs for each of them, each with to separate output directories.

Baselines

Here we provide command lines for the main filter baselines found in Table 3 of our paper, along with short descriptions. Each baseline reads the .parquet metadata files (and also the .npz files when needed) , selects a subset of uids, sorts them, and saves them to a .npy subset file. This file can then be input to the resharder described above to create a webdataset containing only the selected subset of the pool.

Note: the --num_workers flag controls the number of metadata files that are read into memory and processed on parallel. It is set by default to the number of cores, but that may be too much for machine with many cores and limited memory. For baselines other than image-filtering, allow at least 256MB of memory per worker.

No filtering

Here we load all metadata uids without any additional filtering.

python baselines.py --metadata_dir path/to/metadata --save_path path/to/no_filter.npy --name no_filter

Basic filtering

Simple checks on caption length, english being the detected caption language, image size, and image aspect ratio.

python baselines.py --metadata_dir path/to/metadata --save_path path/to/basic_filter.npy --name basic_filter

CLIP score filtering

Retain the top k=0.3 fraction of the pool by L/14 CLIP score.

python baselines.py --metadata_dir path/to/metadata --save_path path/to/clip_score_l14_30_percent.npy --name clip_score --arch l14 --fraction 0.3

Retain all examples with B/32 CLIP score above 0.25.

python baselines.py --metadata_dir path/to/metadata --save_path path/to/clip_score_b32_25_threshold.npy --name clip_score --arch b32 --threshold 0.25

LAION-2B filtering

Reproduces the filtering strategy used to create the LAION-2B dataset: applies a B/32 CLIP score filter on image-text pairs, retaining samples with score above 0.28, and an English filter using the gcld3 model to detect language.

python baselines.py --metadata_dir path/to/metadata --save_path path/to/laion.npy --name laion2b

Text-based filtering

A text filter captions that contain words from the ImageNet-21k synsets.

python baselines.py --metadata_dir path/to/metadata --save_path path/to/text_based.npy --name text_based

Image-based filtering

A image clustering based method that retains samples whose images have content close to ImageNet-1k training images, as measured by the nearest-neighbor cluster center of the image's L/14 CLIP embedding.

Note: this baseline uses GPU resources. By default it will try to use all GPUs. To control which GPUs are used, set the CUDA_VISIBLE_DEVICES environment variable.

python baselines.py --metadata_dir path/to/metadata --save_path path/to/image_based.npy --name image_based --image_based_scale small --batch_size 512

Note: this baseline requires pre-computed image cluster centroids which will be downloaded automatical

Related Skills

node-connect

353.3k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

111.7k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

353.3k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

353.3k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。