SEARLE (ICCV 2023)

Zero-shot Composed Image Retrieval With Textual Inversion

🔥🔥 [2024/05/07] The extended version of our ICCV 2023 paper is now public: iSEARLE: Improving Textual Inversion for Zero-Shot Composed Image Retrieval . The code will be released upon acceptance.

This is the official repository of the ICCV 2023 paper "Zero-Shot Composed Image Retrieval with Textual Inversion" and its extended version "iSEARLE: Improving Textual Inversion for Zero-Shot Composed Image Retrieval".

You are currently viewing the code and model repository. If you are looking for more information about the newly-proposed dataset CIRCO see the repository.

Overview

Abstract

Composed Image Retrieval (CIR) aims to retrieve a target image based on a query composed of a reference image and a relative caption that describes the difference between the two images. The high effort and cost required for labeling datasets for CIR hamper the widespread usage of existing methods, as they rely on supervised learning. In this work, we propose a new task, Zero-Shot CIR (ZS-CIR), that aims to address CIR without requiring a labeled training dataset. Our approach, named zero-Shot composEd imAge Retrieval with textuaL invErsion (SEARLE), maps the visual features of the reference image into a pseudo-word token in CLIP token embedding space and integrates it with the relative caption. To support research on ZS-CIR, we introduce an open-domain benchmarking dataset named Composed Image Retrieval on Common Objects in context (CIRCO), which is the first dataset for CIR containing multiple ground truths for each query. The experiments show that SEARLE exhibits better performance than the baselines on the two main datasets for CIR tasks, FashionIQ and CIRR, and on the proposed CIRCO.

Workflow of our method. Top: in the pre-training phase, we generate pseudo-word tokens of unlabeled images with an optimization-based textual inversion and then distill their knowledge to a textual inversion network. Bottom: at inference time on ZS-CIR, we map the reference image to a pseudo-word $S_*$ and concatenate it with the relative caption. Then, we use CLIP text encoder to perform text-to-image retrieval.

Citation

@article{agnolucci2024isearle,
  title={iSEARLE: Improving Textual Inversion for Zero-Shot Composed Image Retrieval}, 
  author={Agnolucci, Lorenzo and Baldrati, Alberto and Bertini, Marco and Del Bimbo, Alberto},
  journal={arXiv preprint arXiv:2405.02951},
  year={2024},
}

@inproceedings{baldrati2023zero,
  title={Zero-Shot Composed Image Retrieval with Textual Inversion},
  author={Baldrati, Alberto and Agnolucci, Lorenzo and Bertini, Marco and Del Bimbo, Alberto},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={15338--15347},
  year={2023}
}

<details> <summary><h2>Getting Started</h2></summary>

We recommend using the Anaconda package manager to avoid dependency/reproducibility problems. For Linux systems, you can find a conda installation guide here.

Installation

Clone the repository

git clone https://github.com/miccunifi/SEARLE

Install Python dependencies

conda create -n searle -y python=3.8
conda activate searle
conda install -y -c pytorch pytorch=1.11.0 torchvision=0.12.0
pip install comet-ml==3.33.6 transformers==4.24.0 tqdm pandas==1.4.2
pip install git+https://github.com/openai/CLIP.git

Data Preparation

FashionIQ

Download the FashionIQ dataset following the instructions in the official repository.

After downloading the dataset, ensure that the folder structure matches the following:

├── FashionIQ
│   ├── captions
|   |   ├── cap.dress.[train | val | test].json
|   |   ├── cap.toptee.[train | val | test].json
|   |   ├── cap.shirt.[train | val | test].json

│   ├── image_splits
|   |   ├── split.dress.[train | val | test].json
|   |   ├── split.toptee.[train | val | test].json
|   |   ├── split.shirt.[train | val | test].json

│   ├── images
|   |   ├── [B00006M009.jpg | B00006M00B.jpg | B00006M6IH.jpg | ...]

CIRR

Download the CIRR dataset following the instructions in the official repository.

After downloading the dataset, ensure that the folder structure matches the following:

├── CIRR
│   ├── train
|   |   ├── [0 | 1 | 2 | ...]
|   |   |   ├── [train-10108-0-img0.png | train-10108-0-img1.png | ...]

│   ├── dev
|   |   ├── [dev-0-0-img0.png | dev-0-0-img1.png | ...]

│   ├── test1
|   |   ├── [test1-0-0-img0.png | test1-0-0-img1.png | ...]

│   ├── cirr
|   |   ├── captions
|   |   |   ├── cap.rc2.[train | val | test1].json
|   |   ├── image_splits
|   |   |   ├── split.rc2.[train | val | test1].json

CIRCO

Download the CIRCO dataset following the instructions in the official repository.

After downloading the dataset, ensure that the folder structure matches the following:

├── CIRCO
│   ├── annotations
|   |   ├── [val | test].json

│   ├── COCO2017_unlabeled
|   |   ├── annotations
|   |   |   ├──  image_info_unlabeled2017.json
|   |   ├── unlabeled2017
|   |   |   ├── [000000243611.jpg | 000000535009.jpg | ...]

ImageNet

Download ImageNet1K (ILSVRC2012) test set following the instructions in the official site.

After downloading the dataset, ensure that the folder structure matches the following:

├── ImageNet1K
│   ├── test
|   |   ├── [ILSVRC2012_test_[00000001 | ... | 00100000].JPEG]

</details> <details> <summary><h2>SEARLE Inference with Pre-trained Models</h2></summary>

Validation

To compute the metrics on the validation set of FashionIQ, CIRR or CIRCO using the SEARLE pre-trained models, simply run the following command:

python src/validate.py --eval-type [searle | searle-xl] --dataset <str> --dataset-path <str>

    --eval-type <str>               if 'searle', uses the pre-trained SEARLE model to predict the pseudo tokens;
                                    if 'searle-xl', uses the pre-trained SEARLE-XL model to predict the pseudo tokens, 
                                    options: ['searle', 'searle-xl']           
    --dataset <str>                 Dataset to use, options: ['fashioniq', 'cirr', 'circo']
    --dataset-path <str>            Path to the dataset root folder
     
    --preprocess-type <str>         Preprocessing type, options: ['clip', 'targetpad'] (default=targetpad)

Since we release the pre-trained models via torch.hub, the models will be automatically downloaded when running the inference script.

The metrics will be printed on the screen.

Test

To generate the predictions file for uploading on the CIRR Evaluation Server or the CIRCO Evaluation Server using the SEARLE pre-trained models, please execute the following command:

python src/generate_test_submission.py --submission-name <str>  --eval-type [searle | searle-xl] --dataset <str> --dataset-path <str>

    --submission-name <str>         Name of the submission file
    --eval-type <str>               if 'searle', uses the pre-trained SEARLE model to predict the pseudo tokens;
                                    if 'searle-xl', uses the pre-trained SEARLE-XL model to predict the pseudo tokens, 
                                    options: ['searle', 'searle-xl']           
    --dataset <str>                 Dataset to use, options: ['cirr', 'circo']
    --dataset-path <str>            Path to the dataset root folder
    
    --preprocess-type <str>         Preprocessing type, options: ['clip', 'targetpad'] (default=targetpad)

Since we release the pre-trained models via torch.hub, the models will be automatically downloaded when running the inference script.

The predictions file will be saved in the data/test_submissions/{dataset}/ folder.

</details> <details> <summary><h2>SEARLE Minimal Working Example</h2></summary>

import torch
import clip
from PIL import Image

# set device
device = "cuda" if torch.cuda.is_available() else "cpu"

image_path = "path to image to invert"  # TODO change with your image path
clip_model_name = "ViT-B/32"  # use ViT-L/14 for SEARLE-XL

# load SEARLE model and custom text encoding function
searle, encode_with_pseudo_tokens = torch.hub.load(repo_or_dir='miccunifi/SEARLE', source='github', model='searle',
                                                   backbone=clip_model_name)
searle.to(device)

# load CLIP model and preprocessing function
clip_model, preprocess = clip.load(clip_model_name)

# NOTE: the preprocessing function used to train SEARLE is different from the standard CLIP preprocessing function. Here,
# we use the standard one for simplicity, but if you want to reproduce the results of the paper you should use the one
# provided in the SEARLE repository (named t

SEARLE

Install / Use

README

SEARLE (ICCV 2023)

Zero-shot Composed Image Retrieval With Textual Inversion

Overview

Abstract

Citation

Installation

Data Preparation

FashionIQ

CIRR

CIRCO

ImageNet

Validation

Test