Lxmert
PyTorch code for EMNLP 2019 paper "LXMERT: Learning Cross-Modality Encoder Representations from Transformers".
Install / Use
/learn @airsplay/LxmertREADME
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
Our servers break again :(. I have updated the links so that they should work fine now. Sorry for the inconvenience. Please let me for any further issues. Thanks! --Hao, Dec 03
Introduction
PyTorch code for the EMNLP 2019 paper "LXMERT: Learning Cross-Modality Encoder Representations from Transformers". Slides of our EMNLP 2019 talk are avialable here.
- To analyze the output of pre-trained model (instead of fine-tuning on downstreaming tasks), please load the weight
https://nlp.cs.unc.edu/data/github_pretrain/lxmert20/Epoch20_LXRT.pth, which is trained as in section pre-training. The default weight here is trained with a slightly different protocal as this code.
Results (with this Github version)
| Split | VQA | GQA | NLVR2 | |----------- |:----: |:---: |:------:| | Local Validation | 69.90% | 59.80% | 74.95% | | Test-Dev | 72.42% | 60.00% | 74.45% (Test-P) | | Test-Standard | 72.54% | 60.33% | 76.18% (Test-U) |
All the results in the table are produced exactly with this code base. Since VQA and GQA test servers only allow limited number of 'Test-Standard' submissions, we use our remaining submission entry from the VQA/GQA challenges 2019 to get these results. For NLVR2, we only test once on the unpublished test set (test-U).
We use this code (with model ensemble) to participate in VQA 2019 and GQA 2019 challenge in May 2019. We are the only team ranking top-3 in both challenges.
Pre-trained models
The pre-trained model (870 MB) is available at http://nlp.cs.unc.edu/data/model_LXRT.pth, and can be downloaded with:
mkdir -p snap/pretrained
wget https://nlp.cs.unc.edu/data/model_LXRT.pth -P snap/pretrained
If download speed is slower than expected, the pre-trained model could also be downloaded from other sources.
Please help put the downloaded file at snap/pretrained/model_LXRT.pth.
We also provide data and commands to pre-train the model in pre-training. The default setup needs 4 GPUs and takes around a week to finish. The pre-trained weights with this code base could be downloaded from https://nlp.cs.unc.edu/data/github_pretrain/lxmert/EpochXX_LXRT.pth, XX from 01 to 12. It is pre-trained for 12 epochs (instead of 20 in EMNLP paper) thus the fine-tuned reuslts are about 0.3% lower on each datasets.
Fine-tune on Vision-and-Language Tasks
We fine-tune our LXMERT pre-trained model on each task with following hyper-parameters:
|Dataset | Batch Size | Learning Rate | Epochs | Load Answers | |--- |:---:|:---: |:---:|:---:| |VQA | 32 | 5e-5 | 4 | Yes | |GQA | 32 | 1e-5 | 4 | Yes | |NLVR2 | 32 | 5e-5 | 4 | No |
Although the fine-tuning processes are almost the same except for different hyper-parameters, we provide descriptions for each dataset to take care of all details.
General
The code requires Python 3 and please install the Python dependencies with the command:
pip install -r requirements.txt
By the way, a Python 3 virtual environment could be set up and run with:
virtualenv name_of_environment -p python3
source name_of_environment/bin/activate
VQA
Fine-tuning
-
Please make sure the LXMERT pre-trained model is either downloaded or pre-trained.
-
Download the re-distributed json files for VQA 2.0 dataset. The raw VQA 2.0 dataset could be downloaded from the official website.
mkdir -p data/vqa wget https://nlp.cs.unc.edu/data/lxmert_data/vqa/train.json -P data/vqa/ wget https://nlp.cs.unc.edu/data/lxmert_data/vqa/nominival.json -P data/vqa/ wget https://nlp.cs.unc.edu/data/lxmert_data/vqa/minival.json -P data/vqa/ -
Download faster-rcnn features for MS COCO train2014 (17 GB) and val2014 (8 GB) images (VQA 2.0 is collected on MS COCO dataset). The image features are also available on Google Drive and Baidu Drive (see Alternative Download for details).
mkdir -p data/mscoco_imgfeat wget https://nlp.cs.unc.edu/data/lxmert_data/mscoco_imgfeat/train2014_obj36.zip -P data/mscoco_imgfeat unzip data/mscoco_imgfeat/train2014_obj36.zip -d data/mscoco_imgfeat && rm data/mscoco_imgfeat/train2014_obj36.zip wget https://nlp.cs.unc.edu/data/lxmert_data/mscoco_imgfeat/val2014_obj36.zip -P data/mscoco_imgfeat unzip data/mscoco_imgfeat/val2014_obj36.zip -d data && rm data/mscoco_imgfeat/val2014_obj36.zip -
Before fine-tuning on whole VQA 2.0 training set, verifying the script and model on a small training set (512 images) is recommended. The first argument
0is GPU id. The second argumentvqa_lxr955_tinyis the name of this experiment.bash run/vqa_finetune.bash 0 vqa_lxr955_tiny --tiny -
If no bug came out, then the model is ready to be trained on the whole VQA corpus:
bash run/vqa_finetune.bash 0 vqa_lxr955
It takes around 8 hours (2 hours per epoch * 4 epochs) to converge.
The logs and model snapshots will be saved under folder snap/vqa/vqa_lxr955.
The validation result after training will be around 69.7% to 70.2%.
Local Validation
The results on the validation set (our minival set) are printed while training.
The validation result is also saved to snap/vqa/[experiment-name]/log.log.
If the log file was accidentally deleted, the validation result in training is also reproducible from the model snapshot:
bash run/vqa_test.bash 0 vqa_lxr955_results --test minival --load snap/vqa/vqa_lxr955/BEST
Submitted to VQA test server
- Download our re-distributed json file containing VQA 2.0 test data.
wget https://nlp.cs.unc.edu/data/lxmert_data/vqa/test.json -P data/vqa/ - Download the faster rcnn features for MS COCO test2015 split (16 GB).
wget https://nlp.cs.unc.edu/data/lxmert_data/mscoco_imgfeat/test2015_obj36.zip -P data/mscoco_imgfeat unzip data/mscoco_imgfeat/test2015_obj36.zip -d data && rm data/mscoco_imgfeat/test2015_obj36.zip - Since VQA submission system requires submitting whole test data, we need to run inference over all test splits
(i.e., test dev, test standard, test challenge, and test held-out).
It takes around 10~15 mins to run test inference (448K instances to run).
bash run/vqa_test.bash 0 vqa_lxr955_results --test test --load snap/vqa/vqa_lxr955/BEST
The test results will be saved in snap/vqa_lxr955_results/test_predict.json.
The VQA 2.0 challenge for this year is host on EvalAI at https://evalai.cloudcv.org/web/challenges/challenge-page/163/overview
It still allows submission after the challenge ended.
Please check the official website of VQA Challenge for detailed information and
follow the instructions in EvalAI to submit.
In general, after registration, the only thing remaining is to upload the test_predict.json file and wait for the result back.
The testing accuracy with exact this code is 72.42% for test-dev and 72.54% for test-standard.
The results with the code base are also publicly shown on the VQA 2.0 leaderboard with entry LXMERT github version.
GQA
Fine-tuning
-
Please make sure the LXMERT pre-trained model is either downloaded or pre-trained.
-
Download the re-distributed json files for GQA balanced version dataset. The original GQA dataset is available in the Download section of its website and the script to preprocess these datasets is under
data/gqa/process_raw_data_scripts.mkdir -p data/gqa wget https://nlp.cs.unc.edu/data/lxmert_data/gqa/train.json -P data/gqa/ wget https://nlp.cs.unc.edu/data/lxmert_data/gqa/valid.json -P data/gqa/ wget https://nlp.cs.unc.edu/data/lxmert_data/gqa/testdev.json -P data/gqa/ -
Download Faster R-CNN features for Visual Genome and GQA testing images (30 GB). GQA's training and validation data are collected from Visual Genome. Its testing images come from MS COCO test set (I have verified this with one of GQA authors Drew A. Hudson). The image features are also available on Google Drive and Baidu Drive (see Alternative Download for details).
mkdir -p data/vg_gqa_imgfeat wget https://nlp.cs.unc.edu/data/lxmert_data/vg_gqa_imgfeat/vg_gqa_obj36.zip -P data/vg_gqa_imgfeat unzip data/vg_gqa_imgfeat/vg_gqa_obj36.zip -d data && rm data/vg_gqa_imgfeat/vg_gqa_obj36.zip wget https://nlp.cs.unc.edu/data/lxmert_data/vg_gqa_imgfeat/gqa_testdev_obj36.zip -P data/vg_gqa_imgfeat unzip data/vg_gqa_imgfeat/gqa_testdev_obj36.zip -d data && rm data/vg_gqa_imgf
Related Skills
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
groundhog
398Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
last30days-skill
13.8kAI agent skill that researches any topic across Reddit, X, YouTube, HN, Polymarket, and the web - then synthesizes a grounded summary
000-main-rules
Project Context - Name: Interactive Developer Portfolio - Stack: Next.js (App Router), TypeScript, React, Tailwind CSS, Three.js - Architecture: Component-driven UI with a strict separation of conce
