A Unified End-to-End Retriever-Reader Framework for Knowledge-based VQA

Official implementation for the MM'22 paper.

model structure

There are two config files in cfgs, individually for the OK-VQA and FVQA datasets. Note that we mainly test our method on the OK-VQA dataset.

Prerequisites

python==3.7
pytorch==1.10.0

Dataset

First of all, make sure all the data in the right position according to the config file settings.

Please download the OK-VQA dataset from the link of the original paper.
The image features can be found at the LXMERT (If you need only the ViLT model, then skip these features and only download the mscoco images.).

Pre-processing:

The last step is optional for LXMERT and VisualBert only.

Process answers:
```
python tools/answer_parse_okvqa.py 
```
Extract knowledge base with Roberta:
```
python tools/kb_parse.py
```

Convert image features to h5 (optional):

python tools/detection_features_converter.py

Model Training:

python main.py --name unifer --gpu 0

Model Evaluation:

python main.py --name unifer --test-only

Citation:

If you found this repo helpful, please consider cite the following paper :+1: :

@inproceedings{unifer,
  author    = {Yangyang Guo and Liqiang Nie and Yongkang Wong and Yibing Liu and Zhiyong Cheng and Mohan S. Kankanhalli},
  title     = {A Unified End-to-End Retriever-Reader Framework for Knowledge-based VQA},
  booktitle = {ACM Multimedia Conference},
  publisher = {ACM},
  year      = {2022}}

UnifER

Install / Use

README