VLE: Vision-Language Encoder

Multimodal pre-trained models are trained on massive multimodal data, and they can utilize information from different modalities and perform various cross-modal tasks.

In this repository, we introduce VLE (Vision-Language Encoder), an image-text multimodal understanding model built on the pre-trained text and image encoders. It can be used for multimodal discriminative tasks such as visual question answering and image-text retrieval. Especially on the visual commonsense reasoning (VCR) task, which requires high-level language understanding and reasoning skills, VLE achieves the best performance among the public methods.

Recently, LLMs (Large Language Models) have achieved great success and have been used for a wide range of text tasks, including translation, question answering, text summarization, etc. While LLMs are unimodal, their abilities can be leveraged for multimodal understanding tasks. We propose a VQA+LLM pipeline that integrates multimodal models with LLMs for the visual question answering task. It helps the VQA model generate more accurate and fluent answers.

We open-source VLE-related resources for promoting academic research and better facilitating our community.

Try our VLE-based VQA Demo at 🤗Space 👇👇👇

More resources released by HFL: https://github.com/iflytek/HFL-Anthology

| Section | Description | | ----------------------------- | ----------------------------------- | | Introduction | Introduction to VLE | | Downloads | Download links for VLE | | Comparison | Comparison of VLE with other models | | VQA with LLM | Visual question answering with LLM | | Usage | How to load VLE for different tasks |

Introduction

Structure

The structure of VLE is similar to METER, which consists of two unimodal encoders for text and image separately, followed by a crossmodal fusion module. However, there are several structural differences between VLE and METER:

VLE uses DeBERTa-v3 as the text encoder, which is stronger than RoBERTa-base used in METER.
In the large version of VLE (VLE-large), the hidden size of the crossmodal co-attention fusion module is scaled up to 1024 to increase capacities.
During fine-tuning, VLE introduces additional token_type_embeddings.

Pre-training

VLE is pre-trained with image-caption pairs. There are four objectives applied during the pre-training stage:

MLM (Masked Language Modeling): Given an image-caption pair, we randomly mask some input text tokens, and the model is trained to reconstruct the original tokens.
ITM (Image-Text Matching): Given a batch of matched or mismatched image-caption pairs, the model needs to identify which images and captions correspond to each other.
MPC (Masked Patch-box Classification): Given an image-caption pair with some patches masked, the model needs to predict the classes of the objects in the masked patches.
PBC (Patch-box Classification): Given an image-caption pair, the models need to identify which patches are related to the caption.

VLE models are pre-trained on 14M public English image-caption pairs for 25k steps with a batch size of 2048.

The following figure illustrates the VLE structure and the pre-training objectives (for simplicity, we omit the PBC objective in the figure).

Adaptation for downstream tasks

Visual Question Answering (VQA)

We follow the standard practice to train the models on VQA with both training and validation data, and test the models on the test-dev set. The pooler output from the last layer of the fusion module is used for classification.

Visual Commonsense Reasoning (VCR)

We format VCR as a multiple-choice task which is similar to RACE. For each object in the image in each example, we append the average of patches that cover the object to the image feature embeddings before the fusion module. We also assign token_type_ids to the objects in the image and text to improve alignment between different modalities.

Downloads

The model weights are in PyTorch format and can be downloaded through the 🤗 transformers model hub. You can either download the weights and configurations manually or initialize a VLE model with from_pretrained(model_name) method in your code. See Usage for details.

Pre-trained Checkpoints

| Model | Text Encoder | Image Encoder | # Params* | MODEL_NAME | Link | | --------- | ---------------- | ---------------------- | -------------------- | ------------- | -------------------------------------------- | | VLE-base | DeBERTa-v3-base | CLIP-ViT-base-patch16 | 378M | hfl/vle-base | link | | VLE-large | DeBERTa-v3-large | CLIP-ViT-large-patch14 | 930M | hfl/vle-large | link |

* : We exclude task heads when counting the number of parameters.

Fine-tuned Checkpoints

| Model | Text Encoder | Image Encoder | MODEL_NAME | Link | | ---------------------- | ---------------- | ---------------------- | -------------------------- | --------------------------------------------------------- | | VLE-base-for-VQA | DeBERTa-v3-base | CLIP-ViT-base-patch16 | hfl/vle-base-for-vqa | link | | VLE-large-for-VQA | DeBERTa-v3-large | CLIP-ViT-large-patch14 | hfl/vle-large-for-vqa | link | | VLE-base-for-VCR-q2a | DeBERTa-v3-base | CLIP-ViT-base-patch16 | hfl/vle-base-for-vcr-q2a | link | | VLE-large-for-VCR-q2a | DeBERTa-v3-large | CLIP-ViT-large-patch14 | hfl/vle-large-for-vcr-q2a | link | | VLE-base-for-VCR-qa2r | DeBERTa-v3-base | CLIP-ViT-base-patch16 | hfl/vle-base-for-vcr-qa2r | link | | VLE-large-for-VCR-qa2r | DeBERTa-v3-large | CLIP-ViT-large-patch14 | hfl/vle-large-for-vcr-qa2r | link |

Comparison

In the following table, we compare the performance of VLE with METER and other multimodal models. The VQA results are on the test-dev set, and the VCR results are on the dev set.

| Model | VQA | VCR (QA2R) | VCR (Q2A) | #Params | #PT data* | | ------------------- | ---------------- | -------------- | ------------- | ------------ | ------- | | CoCa | 82.3 | - | - | 2.1 B | unknown | | BeiT-3 | 84.2 | - | - | 1.9 B | 21M(I-T) + 14M(I) + 160G(T) | | OFA | 82.0 | - | - | 930M | 20M(I-T) + 39M(I) + 140G(T) | | BLIP | 78.3 | - | - | 385M | ~130M(I-T) | | METER-base | 77.7 (76.8†‡) | 79.8§ | 77.6§ | 345M | 9M(I-T) | | METER-Huge | 80.3 | - | - | 878M | 20M(I-T) | | VLE-base | 77.6‡ | 83.7§ | 79.9§ | 378M | 15M(I-T) | | VLE-large | 79.3‡ | 87.5§ | 84.3§ | 930M | 15M(I-T) |

† : Result from our reimplementation.

‡ : Fine-tuning hyperparameters: lr=7e-6, batch_size={256, 512}, num_epochs=10

§ : Fine-tuning hyperparameters: lr=1e-5, batch_size=128, num_epochs=5

* : Pre-training data. I-T: Image-caption pairs. I: Images. T: Text.

From the above results, we can see that:

VLE is pre-training efficient. Compared to models with similar model sizes, VLE achieves comparable or even better performance on VQA with much less pre-training data.
VLE shows higher reasoning ability. Especially it significantly outperforms METER on Visual Commonsense Reasoning (VCR), which requires higher level language and reasoning skills than VQA.

VQA with LLM

Generating Accurate and Fluent VQA Answers

LLMs have achieved great success on a wide range of text tasks, while the abilities of LLMs can also be leveraged for multimodal understandi

VLE

Install / Use

README