SLIP
Code release for SLIP Self-supervision meets Language-Image Pre-training
Install / Use
/learn @facebookresearch/SLIPREADME
SLIP: Self-supervision meets Language-Image Pre-training
<p align="center"><img src="slip.png" alt="SLIP framework" width="400"/></p>What you can find in this repo:
-
Pre-trained models (with ViT-Small, Base, Large) and code to reproduce results from our paper: SLIP: Self-supervision meets Language-Image Pre-training. Norman Mu, Alexander Kirillov, David Wagner and Saining Xie, arXiv 2021
-
An improved CLIP baseline (31.3% → 34.6% ImageNet 0-shot w/ Modified ResNet-50) on YFCC15M dataset.
-
Zero-shot transfer and linear classification evaluation scripts on 26 downstream datasets.
Updates:
Jan 18 2022: Added support for training on RedCaps
Jan 17 2022: Released CC3M/CC12M CLIP/SLIP ViT-B checkpoints
Results and Pre-trained Models
The following models are pre-trained on YFCC15M and evaluated on ImageNet-1K (ILSVRC2012).
ViT-Small (MoCo v3 version w/ 12 vs. 6 heads)
<table><tbody> <!-- START TABLE --> <!-- TABLE HEADER --> <th valign="center">Method</th> <th valign="center">Epochs</th> <th valign="center">0-shot</th> <th valign="center">Linear</th> <th valign="center">Finetuned</th> <th valign="center">Weights</th> <!-- TABLE BODY --> <tr> <td align="center">CLIP</td> <td align="center">25</td> <td align="center">32.7</td> <td align="center">59.3</td> <td align="center">78.2</td> <td align="center"><a href="https://dl.fbaipublicfiles.com/slip/clip_small_25ep.pt">url</a></td> </tr> <tr> <td align="center">SimCLR</td> <td align="center">25</td> <td align="center">-</td> <td align="center">58.1</td> <td align="center">79.9</td> <td align="center"><a href="https://dl.fbaipublicfiles.com/slip/simclr_small_25ep.pt">url</a></td> </tr> <tr> <td align="center">SLIP</td> <td align="center">25</td> <td align="center">38.3</td> <td align="center">66.4</td> <td align="center">80.3</td> <td align="center"><a href="https://dl.fbaipublicfiles.com/slip/slip_small_25ep.pt">url</a></td> </tr> <tr> <td align="center">SLIP</td> <td align="center">50</td> <td align="center">39.3</td> <td align="center">67.6</td> <td align="center">80.7</td> <td align="center"><a href="https://dl.fbaipublicfiles.com/slip/slip_small_50ep.pt">url</a></td> </tr> <tr> <td align="center">SLIP</td> <td align="center">100</td> <td align="center">39.5</td> <td align="center">68.3</td> <td align="center">80.7</td> <td align="center"><a href="https://dl.fbaipublicfiles.com/slip/slip_small_100ep.pt">url</a></td> </tr> </tbody></table>ViT-Base
<table><tbody> <!-- START TABLE --> <!-- TABLE HEADER --> <th valign="center">Method</th> <th valign="center">Epochs</th> <th valign="center">0-shot</th> <th valign="center">Linear</th> <th valign="center">Finetuned</th> <th valign="center">Weights</th> <!-- TABLE BODY --> <tr> <td align="center">CLIP</td> <td align="center">25</td> <td align="center">37.6</td> <td align="center">66.5</td> <td align="center">80.5</td> <td align="center"><a href="https://dl.fbaipublicfiles.com/slip/clip_base_25ep.pt">url</a></td> </tr> <tr> <td align="center">SimCLR</td> <td align="center">25</td> <td align="center">-</td> <td align="center">64.0</td> <td align="center">82.5</td> <td align="center"><a href="https://dl.fbaipublicfiles.com/slip/simclr_base_25ep.pt">url</a></td> </tr> <tr> <td align="center">SLIP</td> <td align="center">25</td> <td align="center">42.8</td> <td align="center">72.1</td> <td align="center">82.6</td> <td align="center"><a href="https://dl.fbaipublicfiles.com/slip/slip_base_25ep.pt">url</a></td> </tr> <tr> <td align="center">SLIP</td> <td align="center">50</td> <td align="center">44.1</td> <td align="center">73.0</td> <td align="center">82.9</td> <td align="center"><a href="https://dl.fbaipublicfiles.com/slip/slip_base_50ep.pt">url</a></td> </tr> <tr> <td align="center">SLIP</td> <td align="center">100</td> <td align="center">45.0</td> <td align="center">73.6</td> <td align="center">83.4</td> <td align="center"><a href="https://dl.fbaipublicfiles.com/slip/slip_base_100ep.pt">url</a></td> </tr> </tbody></table>ViT-Large
<table><tbody> <!-- START TABLE --> <!-- TABLE HEADER --> <th valign="center">Method</th> <th valign="center">Epochs</th> <th valign="center">0-shot</th> <th valign="center">Linear</th> <th valign="center">Finetuned</th> <th valign="center">Weights</th> <!-- TABLE BODY --> <tr> <td align="center">CLIP</td> <td align="center">25</td> <td align="center">40.4</td> <td align="center">70.5</td> <td align="center">81.0</td> <td align="center"><a href="https://dl.fbaipublicfiles.com/slip/clip_large_25ep.pt">url</a></td> </tr> <tr> <td align="center">SimCLR</td> <td align="center">25</td> <td align="center">-</td> <td align="center">66.7</td> <td align="center">84.0</td> <td align="center"><a href="https://dl.fbaipublicfiles.com/slip/simclr_large_25ep.pt">url</a></td> </tr> <tr> <td align="center">SLIP</td> <td align="center">25</td> <td align="center">46.2</td> <td align="center">76.0</td> <td align="center">84.2</td> <td align="center"><a href="https://dl.fbaipublicfiles.com/slip/slip_large_25ep.pt">url</a></td> </tr> <tr> <td align="center">SLIP</td> <td align="center">50</td> <td align="center">47.4</td> <td align="center">75.8</td> <td align="center">84.7</td> <td align="center"><a href="https://dl.fbaipublicfiles.com/slip/slip_large_50ep.pt">url</a></td> </tr> <tr> <td align="center">SLIP</td> <td align="center">100</td> <td align="center">47.9</td> <td align="center">75.1</td> <td align="center">84.8</td> <td align="center"><a href="https://dl.fbaipublicfiles.com/slip/slip_large_100ep.pt">url</a></td> </tr> </tbody></table>Additional Datasets and Models
<table><tbody> <!-- START TABLE --> <!-- TABLE HEADER --> <th valign="center">Dataset</th> <th valign="center">Method</th> <th valign="center">Model</th> <th valign="center">Epochs</th> <th valign="center">0-shot</th> <th valign="center">Linear</th> <th valign="center">Finetuned</th> <th valign="center">Weights</th> <!-- TABLE BODY --> <tr> <td align="center">CC3M</td> <td align="center">CLIP</td> <td align="center">ViT-B</td> <td align="center">40</td> <td align="center">17.1</td> <td align="center">53.3</td> <td align="center">79.5</td> <td align="center"><a href="https://dl.fbaipublicfiles.com/slip/clip_base_cc3m_40ep.pt">url</a></td> </tr> <tr> <td align="center">CC3M</td> <td align="center">SLIP</td> <td align="center">ViT-B</td> <td align="center">40</td> <td align="center">23.0</td> <td align="center">65.4</td> <td align="center">81.4</td> <td align="center"><a href="https://dl.fbaipublicfiles.com/slip/slip_base_cc3m_40ep.pt">url</a></td> </tr> <tr> <td align="center">CC12M</td> <td align="center">CLIP</td> <td align="center">ViT-B</td> <td align="center">35</td> <td align="center">36.5</td> <td align="center">69.0</td> <td align="center">82.1</td> <td align="center"><a href="https://dl.fbaipublicfiles.com/slip/clip_base_cc12m_35ep.pt">url</a></td> </tr> <tr> <td align="center">CC12M</td> <td align="center">SLIP</td> <td align="center">ViT-B</td> <td align="center">35</td> <td align="center">40.7</td> <td align="center">73.7</td> <td align="center">83.1</td> <td align="center"><a href="https://dl.fbaipublicfiles.com/slip/slip_base_cc12m_35ep.pt">url</a></td> </tr> </tbody></table>1. Setup
Install PyTorch and timm. The code has been tested with CUDA 11.3/CuDNN 8.2.0, PyTorch 1.10.0 and timm 0.5.0.
1.1. YFCC15M Setup
Download the YFCC100M dataset.
Our dataloader expects the following dataset directory structure with 100 folders containing 1000 zip archives of 1000 images each.
The concatenation of the folder, archive, and file names is the index of the image (i.e. image 12345678 is stored as 678.jpg within 12/345.zip):
/path/to/yfcc100m/
├── images/
│ ├── 00/
│ │ └── 000.zip
│ │ │ ├── 000.jpg
│ │ │ │ ...
│ │ │ └── 999.jpg
│ │ ...
│ │ └── 999.zip
│ ...
│ └── 99/
...
Prepare the YFCC15M subset metadata pickle:
- Download and compile a list of downloaded images to
flickr_unique_ids.npy(ours) - Download OpenAI's list of captioned YFCC100M images according to instructions here
- Run
python make_dataset.pyto create theyfcc15m.pklmetadata pickle
When pre-training with YFCC15M, set --dataset yfcc15m --root /path/to/yfcc100m --metadata /path/to/yfcc15m.pkl.
1.2. COCO Captions Setup
Download and unzip the 2017 Train images and annotations.
When pre-training on COCO, set --dataset coco --root /path/to/coco --metadata /path/to/captions_train2017.json.
1.3. Conceptual Captions Setup
CC3M and CC12M are published as tsv files listing original image urls and processed captions.
Download images and collect the captions of all available images (many will be missing due to broken links) into cc3m.npy and cc12m.npy.
For CC3M our dataloader expects cc3m.npy to contain a NumPy array of dicts in the following format:
{
'image_id': 1510438788, # local file path relative to root
'captions': ['large field with pink tulips on a clear sunny summer day with a blue sky']
}
For CC12M our dataloader expects cc12m.npy to contain a NumPy array of dicts in the following format:
{
'image_name': '0.jpg', # local file path relative to root
'image_id': 0,
'ca
