LAPS
Linguistic-Aware Patch Slimming Framework for Fine-grained Cross-Modal Alignment, CVPR, 2024
Install / Use
/learn @CrossmodalGroup/LAPSREADME
Linguistic-Aware Patch Slimming Framework for Fine-grained Cross-Modal Alignment
The official codes for our paper "Linguistic-Aware Patch Slimming Framework for Fine-grained Cross-Modal Alignment", which is accepted by the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. We referred to the implementations of VSE++, SCAN, GPO, and HREM to build up the repository.
Introduction
Cross-modal alignment aims to build a bridge connecting vision and language. It is an important multi-modal task that efficiently learns the semantic similarities between images and texts. Traditional fine-grained alignment methods heavily rely on pre-trained object detectors to extract region features for subsequent region-word alignment, thereby incurring substantial computational costs for region detection and error propagation issues for two-stage training.
<div align=center> <img src="imgs/fig1-1.jpg" width="80%"> </div>In this paper, we focus on the mainstream vision transformer, incorporating patch features for patch-word alignment, while addressing the resultant issue of visual patch redundancy and patch ambiguity for semantic alignment. We propose a novel Linguistic-Aware Patch Slimming (LAPS) framework for fine-grained alignment, which explicitly identifies redundant visual patches with language supervision and rectifies their semantic and spatial information to facilitate more effective and consistent patch-word alignment. Extensive experiments on various evaluation benchmarks and model backbones show LAPS outperforms the state-of-the-art fine-grained alignment methods.
<div align=center> <img src="imgs/fig1-2.jpg" width="100%"> </div>Preparation
Environments
We recommended the following dependencies:
- python >= 3.8
- torch >= 1.12.0
- torchvision >= 0.13.0
- transformers >=4.32.0
- opencv-python
- tensorboard
Datasets
We have prepared the caption files for two datasets in data/ folder, hence you just need to download the images of the datasets.
The Flickr30K (f30k) images can be downloaded in flickr30k-images. The MSCOCO (coco) images can be downloaded in train2014, and val2014.
We hope that the final data are organized as follows:
data
├── coco # coco captions
│ ├── train_ids.txt
│ ├── train_caps.txt
│ ├── testall_ids.txt
│ ├── testall_caps.txt
│ └── id_mapping.json
│
├── f30k # f30k captions
│ ├── train_ids.txt
│ ├── train_caps.txt
│ ├── test_ids.txt
│ ├── test_caps.txt
│ └── id_mapping.json
│
├── flickr30k-images # f30k images
│
├── coco-images # coco images
│ ├── train2014
│ └── val2014
Model Weights
Our framework needs to get the pre-trained weights for BERT-base, ViT-base, and Swin-base models.
You also can choose the weights downloaded by transformers automatically (the weights will be downloaded at ~/.cache).
Training
First, we set up the arguments, detailed information about the arguments is shown in arguments.py.
--dataset: the chosen datasets, e.g.,f30kandcoco.--data_path: the root path of datasets, e.g.,data/.--multi_gpu: whether to use the multiple GPUs (DDP) to train the models.--gpu-id, the chosen GPU number, e.g., 0-7.--logger_name, the path of logger files, e.g.,runs/f30k_testorruns/coco_test
Then, we run the train.py for model training.
The models need about 20,000 GPU-Memory (one 3090 GPU) when batch size = 64 and about 40,000 GPU-Memory (one A40 GPU) when batch size = 108.
You need to modify the batch size according to the hardware conditions, and we also support the multiple GPUs training.
Besides, considering the GPU-memory limitation, we don't integrate the Gumbel-softmax sampling for the patch selection in the repository.
The performances are not affected much but GPU-memory is reduced a lot (see more details in the paper).
## single GPU
### vit + f30k
python train.py --dataset f30k --gpu-id 0 --logger_name runs/f30k_vit --batch_size 64 --vit_type vit --embed_size 512 --sparse_ratio 0.5 --aggr_ratio 0.4
### swin + f30k
python train.py --dataset f30k --gpu-id 0 --logger_name runs/f30k_swin --batch_size 64 --vit_type swin --embed_size 512 --sparse_ratio 0.8 --aggr_ratio 0.6
### vit + coco
python train.py --dataset coco --gpu-id 0 --logger_name runs/coco_vit --batch_size 64 --vit_type vit --embed_size 512 --sparse_ratio 0.5 --aggr_ratio 0.4
### swin + coco
python train.py --dataset coco --gpu-id 0 --logger_name runs/coco_swin --batch_size 64 --vit_type swin --embed_size 512 --sparse_ratio 0.8 --aggr_ratio 0.6
## multiple GPUs
### vit + f30k
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.run --nproc_per_node=2 train.py --dataset f30k --multi_gpu 1 --logger_name runs/f30k_vit --batch_size 64 --vit_type vit --embed_size 512 --sparse_ratio 0.5 --aggr_ratio 0.4
### swin + f30k
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.run --nproc_per_node=2 train.py --dataset f30k --multi_gpu 1 --logger_name runs/f30k_swin --batch_size 64 --vit_type swin --embed_size 1024 --sparse_ratio 0.8 --aggr_ratio 0.6
### vit + coco
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.run --nproc_per_node=4 train.py --dataset coco --multi_gpu 1 --logger_name runs/coco_vit --batch_size 64 --vit_type vit --embed_size 512 --sparse_ratio 0.5 --aggr_ratio 0.4
### swin + coco
CUDA_VISIBLE_DEVICES=0,1,2 python -m torch.distributed.run --nproc_per_node=3 train.py --dataset coco --multi_gpu 1 --logger_name runs/coco_swin --batch_size 72 --vit_type swin --embed_size 512 --sparse_ratio 0.8 --aggr_ratio 0.6
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.run --nproc_per_node=4 train.py --dataset coco --multi_gpu 1 --logger_name runs/coco_swin --batch_size 64 --vit_type swin --embed_size 512 --sparse_ratio 0.8 --aggr_ratio 0.6
Evaluation
Run eval.py to evaluate the trained models on f30k or coco datasets, and you need to specify the model paths.
python eval.py --dataset f30k --data_path data/ --gpu-id 0
python eval.py --dataset coco --data_path data/ --gpu-id 1
Performances
The following tables show the reproducing results of cross-modal retrieval on MSCOCO and Flickr30K datasets. We provide the training logs, checkpoints, performances, and hyper-parameters.
|Datasets| Visual encoders |I2T R@1|I2T R@5|T2I R@1|T2I R@5|Model checkpoint| |:---:|:---:|:---:|:---:|:---:|:---:|:---:| |Flickr30K |ViT | 75.8 | 93.8 | 62.5 |87.5 |Link| |Flickr30K |Swin | 84.5 | 97.7 | 72.3 | 92.7 |Link| |MSCOCO-1K |ViT | 78.6 | 96.0 | 65.5 | 91.4 |Link| |MSCOCO-1K |Swin | 83.9 | 97.9 | 51.2 | 79.3 |Link| |MSCOCO-5K |ViT | 56.1 | 83.9 | 71.9 | 93.7 |Link| |MSCOCO-5K |Swin | 65.1 | 90.2 | 51.2 | 79.3 |Link|
Reference
@inproceedings{fu2024linguistic,
title={Linguistic-aware patch slimming framework for fine-grained cross-modal alignment},
author={Fu, Zheren and Zhang, Lei and Xia, Hou and Mao, Zhendong},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={26307--26316},
year={2024}
}
Related Skills
node-connect
341.8kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
84.6kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
341.8kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
84.6kCommit, push, and open a PR
