TFood
[CVPRW22] Official Implementation of T-Food: "Transformer Decoders with MultiModal Regularization for Cross-Modal Food Retrieval". Accepted at CVPR22 's MULA Workshop.
Install / Use
/learn @mshukor/TFoodREADME
T-Food
Official implementation of the paper:
Summary:
- Introduction
- Installation
- Download and Prepare the Recipe1M dataset
- Download Pretrained Models and Logs
- Train the model
- Evaluate a model on the test set
- Citation
- Acknowledgment
Introduction
<p align="center"> <img src="images/main.png" width="600"/> </p>Cross-modal image-recipe retrieval has gained significant attention in recent years. Most work focuses on improving cross-modal embeddings using unimodal encoders, that allow for efficient retrieval in large-scale databases, leaving aside cross-attention between modalities which is more computationally expensive. We propose a new retrieval framework, T-Food (Transformer Decoders with MultiModal Regularization for Cross-Modal Food Retrieval) that exploits the interaction between modalities in a novel regularization scheme, while using only unimodal encoders at test time for efficient retrieval. We also capture the intra-dependencies between recipe entities with a dedicated recipe encoder, and propose new variants of triplet losses with dynamic margins that adapt to the difficulty of the task. Finally, we leverage the power of the recent Vision and Language Pretraining (VLP) models such as CLIP for the image encoder. Our approach outperforms existing approaches by a large margin on the Recipe1M dataset. Specifically, we achieve absolute improvements of 8.1 % (72.6 R@1) and +10.9 % (44.6 R@1) on the 1k and 10k test sets respectively.
Installation
Main requirements: Pytorch 1.7, clip, timm.
First create the conda environment from the env.yaml file:
conda env create --name tfood --file=env/env.yaml
source activate tfood
We use a bootstrap.pytorch as a high level framework, we made a slight modifications to the original repo, to install the correct one in developpement mode:
cd bootstrap.pytorch
pip install -e .
You need also to install clip from the original repo:
pip install git+https://github.com/openai/CLIP.git
Download Pretrained Models and Logs
You can downoload the pretrained weights and logs to test the model from here. For testing our pretrained models, you can also download the tokenized layer1.
Download and Prepare the Recipe1M dataset
Please, create an account on http://im2recipe.csail.mit.edu/ to download the dataset.
Create a directory (root) to download the dataset.
First download the Layers in root and tokenize layer1:
python preprocess/create_tokenized_layer.py --path_layer root/layer1.json --output_path_vocab root/vocab_all.txt --output_path_layer1 root/tokenized_layer_1.json --output_path_tokenized_texts root/text
Download data.h5 in root, unzipp it and and create the lmdb files:
python preprocess/h5py_to_lmdb.py --dir_data root
Optional: Download the images and put them in root/recipe1M/images.
Download classes1M.pkl, and place the file in root.
Train the model
The boostrap/run.py file load the options contained in a yaml file, create the corresponding experiment directory (in exp/dir) and start the training procedure.
For instance, you can train our best model with 2 A100 (40 Gb) GPUs by running:
CUDA_VISIBLE_DEVICES=0,1 python -m bootstrap.run -o options/tfood_clip.yaml \
--misc.data_parrallel true
To launch training of our model without clip you can use tfood_clip.yaml instead.
Then, several files are going to be created:
- options.yaml (copy of options)
- logs.txt (history of print)
- logs.json (batchs and epochs statistics)
- view.html (learning curves)
- ckpt_last_engine.pth.tar (checkpoints of last epoch)
- ckpt_last_model.pth.tar
- ckpt_last_optimizer.pth.tar
- ckpt_best_eval_epoch.metric.recall_at_1_im2recipe_mean_engine.pth.tar (checkpoints of best epoch)
- ckpt_best_eval_epoch.metric.recall_at_1_im2recipe_mean_model.pth.tar
- ckpt_best_eval_epoch.metric.recall_at_1_im2recipe_mean_optimizer.pth.tar
Resume training
You can resume from the last epoch by specifying the options file from the experiment directory while overwritting the exp.resume option (default is None):
python -m bootstrap.run -o logs/path_to_saved_config_file_in_the_log_dir.yaml \
--exp.resume last
Evaluate a model on the test set
You can evaluate your model on the testing set. In this example, boostrap/run.py load the options from your experiment directory, resume the best checkpoint on the validation set and start an evaluation on the testing set instead of the validation set while skipping the training set (train_split is empty).
python -m bootstrap.run \
-o options/tfood_clip.yaml \
--exp.resume best_eval_epoch.metric.recall_at_1_im2recipe_mean \
--dataset.train_split \
--dataset.eval_split test \
--model.metric.nb_bags 10 \
--model.metric.nb_matchs_per_bag 1000 \
--model.metric.trijoint true \
--misc.data_parrallel false
By default, the model is evaluated on the 1k setup, to evaluate on the 10k setup:
python -m bootstrap.run \
-o options/tfood_clip.yaml \
--exp.resume best_eval_epoch.metric.recall_at_1_im2recipe_mean \
--dataset.train_split \
--dataset.eval_split test \
--model.metric.nb_bags 5 \
--model.metric.nb_matchs_per_bag 10000 \
--model.metric.trijoint true \
--misc.data_parrallel false
You can set model.metric.trijoint to false to obtain both metrics; from the dual encoders and the multimodal module.
Citation
@inproceedings{shukor2022transformer,
title={Transformer Decoders with MultiModal Regularization for Cross-Modal Food Retrieval},
author={Shukor, Mustafa and Couairon, Guillaume and Grechka, Asya and Cord, Matthieu},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={4567--4578},
year={2022}
}
Acknowledgment
The code is based on the original code of Adamine. Some code was borrowed from ALBEF, CLIPand timm.
Related Skills
node-connect
335.4kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
82.5kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
335.4kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
82.5kCommit, push, and open a PR
