BridgeTower
Open source code for AAAI 2023 Paper "BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning"
Install / Use
/learn @microsoft/BridgeTowerREADME
BridgeTower
This repo is the official Pytorch implementation of "BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning".
Updates
- Feb. 2023: BridgeTower was integrated into Hugging Face - Transformers.
- Model Hub, Code and Documentation are available.
- Thanks to Anahita Bhiwandiwalla, Tiep Le and Shaoyen Tseng from Intel Labs for their great work!
- Nov. 2022: BridgeTower got accepted by AAAI'23. Code and checkpoints are released.
- Jun. 2022: We released the preprint version in Arxiv.
- May. 2022: BridgeTower (single model, 4M data) achieved 78.73% and 81.15% (base and large) on the VQAv2 Challenge test-std set.
Abstract
Vision-Language (VL) models with the Two-Tower architecture have dominated visual-language representation learning in recent years. Current VL models either use lightweight uni-modal encoders and learn to extract, align and fuse both modalities simultaneously in a deep cross-modal encoder, or feed the last-layer uni-modal representations from the deep pre-trained uni-modal encoders into the top cross-modal encoder. Both approaches potentially restrict vision-language representation learning and limit model performance. In this paper, we propose BridgeTower, which introduces multiple bridge layers that build a connection between the top layers of uni-modal encoders and each layer of the cross-modal encoder. This enables effective bottom-up cross-modal alignment and fusion between visual and textual representations of different semantic levels of pre-trained uni-modal encoders in the cross-modal encoder. Pre-trained with only 4M images, BridgeTower achieves state-of-the-art performance on various downstream vision-language tasks. In particular, on the VQAv2 test-std set, BridgeTower achieves an accuracy of 78.73%, outperforming the previous state-of-the-art model METER by 1.09% with the same pre-training data and almost negligible additional parameters and computational costs. Notably, when further scaling the model, BridgeTower achieves an accuracy of 81.15%, surpassing models that are pre-trained on orders-of-magnitude larger datasets.
Architecture

Main Results


Deployment
- Run
setup.shto set up the environment. - [Optional] We use wandb to track experiments! Please remember to
wandb loginand paste your token before running the script.
Dataset Preparation
- We follow ViLT and use pyarrow to serialize the datasets. See here for details.
- For SNLI-VE dataset, we follow here.
- For VG-QA dataset, except the image-text pairs in VG got from here, image meta data, question answers data and coco split information also need to be downloaded.
- The final file structure of datasets are shown in
setup.sh.
Checkpoints
-
Fine-tuned checkpoints for
- Visual Question Answering on VQAv2: BASE, BASE(w/ VGQA), LARGE, LARGE(w/ VGQA)
- Image-Text Retrieval on Flickr30k: BASE
- Visual Entailment on SNLI-VE: BASE
- Visual Reasoning on NLVR$^2$: BASE
- Image-Text Retrieval on MSCOCO: BASE
-
Here is an example for downloading a checkpoint.
# download azcopy wget https://aka.ms/downloadazcopy-v10-linux tar -xvf downloadazcopy-v10-linux sudo cp ./azcopy_linux_amd64_*/azcopy /usr/bin/ sudo chmod -R 777 /usr/bin/azcopy # azcopy copy [remote path] [local path] azcopy copy "https://chenfei.blob.core.windows.net/data/G/LCI/best_checkpoints/BridgeTower_pt_base.ckpt?sv=2020-10-02&st=2022-11-24T12%3A18%3A49Z&se=2027-11-25T12%3A18%3A00Z&sr=b&sp=r&sig=BJigddAMHfNUtQuTGH8bJUrzAO3LfaeSm48AXUqZngY%3D" "./BridgeTower_pt_base.ckpt"
Pre-training on Image-Text Datasets
# Pre-train BridgeTower Base Model
bash scripts/pre_train.sh
# Pre-train BridgeTower Large Model
bash scripts/pre_train_large.sh
Fine-tuning on Downstream VL Tasks
- VQAv2 Evaluation needs to submit the
jsonfile in thelogs/directory to eval.ai evaluation server to get the test-dev and/or test-std scores.
# Base Model on VQAv2 without VLP
bash scripts/ftfs_base_vqa.sh
# Large Model on VQAv2 without VLP
bash scripts/ftfs_large_vqa.sh
# Base Model on VQAv2 with VLP
bash scripts/ftfpt_base_vqa.sh
# Large Model on VQAv2 with VLP
bash scripts/ftfpt_large_vqa.sh
# Base Model on IRTR-Flickr30K with VLP (directly use ITM with multiple false texts)
bash scripts/ftfpt_base_irtr_f30k.sh
# Base Model on IRTR-Flickr30K with VLP (follow ALBEF to use ITC to sample hard negatives for ITM)
bash scripts/ftfpt_base_irtr_itm_itc_f30k.sh
# Base Model on SNLI-VE with VLP
bash scripts/ftfpt_base_snlive.sh
# Base Model on NLVR^2 with VLP
bash scripts/ftfpt_base_nlvr2.sh
# Base Model on IRTR-MSCOCO with VLP (follow ALBEF to use ITC to sample hard negatives for ITM)
bash scripts/ftfpt_base_irtr_itm_itc_coco.sh
Fine-tuning on Uni-Modal Tasks
# Base Model on CIFAR with VLP
bash scripts/ftfpt_base_cifar.sh
# Base Model on GLUE with VLP
bash scripts/ftfpt_base_glue.sh
Citation
@article{xu2022bridge,
title={BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning},
author={Xu, Xiao and Wu, Chenfei and Rosenman, Shachar and Lal, Vasudev and Che, Wanxiang and Duan, Nan},
journal={arXiv preprint arXiv:2206.08657},
year={2022}
}
Acknowledgement
We are highly grateful for the public code of the following papers, our code is partly based on them:
Related Skills
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
best-practices-researcher
The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app
groundhog
400Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
last30days-skill
19.5kAI agent skill that researches any topic across Reddit, X, YouTube, HN, Polymarket, and the web - then synthesizes a grounded summary
