Lxmert

PyTorch code for EMNLP 2019 paper "LXMERT: Learning Cross-Modality Encoder Representations from Transformers".

Generate Convert Improve

Install / Use

/learn @airsplay/Lxmert

About this skill

Quality Score

0/100

README

LXMERT: Learning Cross-Modality Encoder Representations from Transformers

Our servers break again :(. I have updated the links so that they should work fine now. Sorry for the inconvenience. Please let me for any further issues. Thanks! --Hao, Dec 03

Introduction

PyTorch code for the EMNLP 2019 paper "LXMERT: Learning Cross-Modality Encoder Representations from Transformers". Slides of our EMNLP 2019 talk are avialable here.

To analyze the output of pre-trained model (instead of fine-tuning on downstreaming tasks), please load the weight https://nlp.cs.unc.edu/data/github_pretrain/lxmert20/Epoch20_LXRT.pth, which is trained as in section pre-training. The default weight here is trained with a slightly different protocal as this code.

Results (with this Github version)

| Split | VQA | GQA | NLVR2 | |----------- |:----: |:---: |:------:| | Local Validation | 69.90% | 59.80% | 74.95% | | Test-Dev | 72.42% | 60.00% | 74.45% (Test-P) | | Test-Standard | 72.54% | 60.33% | 76.18% (Test-U) |

All the results in the table are produced exactly with this code base. Since VQA and GQA test servers only allow limited number of 'Test-Standard' submissions, we use our remaining submission entry from the VQA/GQA challenges 2019 to get these results. For NLVR2, we only test once on the unpublished test set (test-U).

We use this code (with model ensemble) to participate in VQA 2019 and GQA 2019 challenge in May 2019. We are the only team ranking top-3 in both challenges.

Pre-trained models

The pre-trained model (870 MB) is available at http://nlp.cs.unc.edu/data/model_LXRT.pth, and can be downloaded with:

mkdir -p snap/pretrained 
wget https://nlp.cs.unc.edu/data/model_LXRT.pth -P snap/pretrained

If download speed is slower than expected, the pre-trained model could also be downloaded from other sources. Please help put the downloaded file at snap/pretrained/model_LXRT.pth.

We also provide data and commands to pre-train the model in pre-training. The default setup needs 4 GPUs and takes around a week to finish. The pre-trained weights with this code base could be downloaded from https://nlp.cs.unc.edu/data/github_pretrain/lxmert/EpochXX_LXRT.pth, XX from 01 to 12. It is pre-trained for 12 epochs (instead of 20 in EMNLP paper) thus the fine-tuned reuslts are about 0.3% lower on each datasets.

Fine-tune on Vision-and-Language Tasks

We fine-tune our LXMERT pre-trained model on each task with following hyper-parameters:

|Dataset | Batch Size | Learning Rate | Epochs | Load Answers | |--- |:---:|:---: |:---:|:---:| |VQA | 32 | 5e-5 | 4 | Yes | |GQA | 32 | 1e-5 | 4 | Yes | |NLVR2 | 32 | 5e-5 | 4 | No |

Although the fine-tuning processes are almost the same except for different hyper-parameters, we provide descriptions for each dataset to take care of all details.

General

The code requires Python 3 and please install the Python dependencies with the command:

pip install -r requirements.txt

By the way, a Python 3 virtual environment could be set up and run with:

virtualenv name_of_environment -p python3
source name_of_environment/bin/activate

VQA

Fine-tuning

Please make sure the LXMERT pre-trained model is either downloaded or pre-trained.

Download the re-distributed json files for VQA 2.0 dataset. The raw VQA 2.0 dataset could be downloaded from the official website.

mkdir -p data/vqa
wget https://nlp.cs.unc.edu/data/lxmert_data/vqa/train.json -P data/vqa/
wget https://nlp.cs.unc.edu/data/lxmert_data/vqa/nominival.json -P  data/vqa/
wget https://nlp.cs.unc.edu/data/lxmert_data/vqa/minival.json -P data/vqa/

Download faster-rcnn features for MS COCO train2014 (17 GB) and val2014 (8 GB) images (VQA 2.0 is collected on MS COCO dataset). The image features are also available on Google Drive and Baidu Drive (see Alternative Download for details).

mkdir -p data/mscoco_imgfeat
wget https://nlp.cs.unc.edu/data/lxmert_data/mscoco_imgfeat/train2014_obj36.zip -P data/mscoco_imgfeat
unzip data/mscoco_imgfeat/train2014_obj36.zip -d data/mscoco_imgfeat && rm data/mscoco_imgfeat/train2014_obj36.zip
wget https://nlp.cs.unc.edu/data/lxmert_data/mscoco_imgfeat/val2014_obj36.zip -P data/mscoco_imgfeat
unzip data/mscoco_imgfeat/val2014_obj36.zip -d data && rm data/mscoco_imgfeat/val2014_obj36.zip

Before fine-tuning on whole VQA 2.0 training set, verifying the script and model on a small training set (512 images) is recommended. The first argument 0 is GPU id. The second argument vqa_lxr955_tiny is the name of this experiment.
```
bash run/vqa_finetune.bash 0 vqa_lxr955_tiny --tiny
```
If no bug came out, then the model is ready to be trained on the whole VQA corpus:
```
bash run/vqa_finetune.bash 0 vqa_lxr955
```

It takes around 8 hours (2 hours per epoch * 4 epochs) to converge. The logs and model snapshots will be saved under folder snap/vqa/vqa_lxr955. The validation result after training will be around 69.7% to 70.2%.

Local Validation

The results on the validation set (our minival set) are printed while training. The validation result is also saved to snap/vqa/[experiment-name]/log.log. If the log file was accidentally deleted, the validation result in training is also reproducible from the model snapshot:

bash run/vqa_test.bash 0 vqa_lxr955_results --test minival --load snap/vqa/vqa_lxr955/BEST

Submitted to VQA test server

Download our re-distributed json file containing VQA 2.0 test data.

wget https://nlp.cs.unc.edu/data/lxmert_data/vqa/test.json -P data/vqa/

Download the faster rcnn features for MS COCO test2015 split (16 GB).

wget https://nlp.cs.unc.edu/data/lxmert_data/mscoco_imgfeat/test2015_obj36.zip -P data/mscoco_imgfeat
unzip data/mscoco_imgfeat/test2015_obj36.zip -d data && rm data/mscoco_imgfeat/test2015_obj36.zip

Since VQA submission system requires submitting whole test data, we need to run inference over all test splits (i.e., test dev, test standard, test challenge, and test held-out). It takes around 10~15 mins to run test inference (448K instances to run).
```
bash run/vqa_test.bash 0 vqa_lxr955_results --test test --load snap/vqa/vqa_lxr955/BEST
```

The test results will be saved in snap/vqa_lxr955_results/test_predict.json. The VQA 2.0 challenge for this year is host on EvalAI at https://evalai.cloudcv.org/web/challenges/challenge-page/163/overview It still allows submission after the challenge ended. Please check the official website of VQA Challenge for detailed information and follow the instructions in EvalAI to submit. In general, after registration, the only thing remaining is to upload the test_predict.json file and wait for the result back.

The testing accuracy with exact this code is 72.42% for test-dev and 72.54% for test-standard. The results with the code base are also publicly shown on the VQA 2.0 leaderboard with entry LXMERT github version.

GQA

Fine-tuning

Please make sure the LXMERT pre-trained model is either downloaded or pre-trained.

Download the re-distributed json files for GQA balanced version dataset. The original GQA dataset is available in the Download section of its website and the script to preprocess these datasets is under data/gqa/process_raw_data_scripts.

mkdir -p data/gqa
wget https://nlp.cs.unc.edu/data/lxmert_data/gqa/train.json -P data/gqa/
wget https://nlp.cs.unc.edu/data/lxmert_data/gqa/valid.json -P data/gqa/
wget https://nlp.cs.unc.edu/data/lxmert_data/gqa/testdev.json -P data/gqa/

Download Faster R-CNN features for Visual Genome and GQA testing images (30 GB). GQA's training and validation data are collected from Visual Genome. Its testing images come from MS COCO test set (I have verified this with one of GQA authors Drew A. Hudson). The image features are also available on Google Drive and Baidu Drive (see Alternative Download for details).

mkdir -p data/vg_gqa_imgfeat
wget https://nlp.cs.unc.edu/data/lxmert_data/vg_gqa_imgfeat/vg_gqa_obj36.zip -P data/vg_gqa_imgfeat
unzip data/vg_gqa_imgfeat/vg_gqa_obj36.zip -d data && rm data/vg_gqa_imgfeat/vg_gqa_obj36.zip
wget https://nlp.cs.unc.edu/data/lxmert_data/vg_gqa_imgfeat/gqa_testdev_obj36.zip -P data/vg_gqa_imgfeat
unzip data/vg_gqa_imgfeat/gqa_testdev_obj36.zip -d data && rm data/vg_gqa_imgf

Related Skills

YC-Killer

2.7k

A library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.

groundhog

398

Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).

last30days-skill

13.8k

AI agent skill that researches any topic across Reddit, X, YouTube, HN, Polymarket, and the web - then synthesizes a grounded summary

000-main-rules

Project Context - Name: Interactive Developer Portfolio - Stack: Next.js (App Router), TypeScript, React, Tailwind CSS, Three.js - Architecture: Component-driven UI with a strict separation of conce

airsplay

View profile

View on GitHub

GitHub Stars966

CategoryEducation

Updated2mo ago

Forks163

airsplay/lxmert

Languages

Python

Security Score

95/100

Audited on Jan 26, 2026

No findings