RagVL

This is the official repo for the paper: "MLLM Is a Strong Reranker: Advancing Multimodal Retrieval-augmented Generation via Knowledge-enhanced Reranking and Noise-injected Training".

Updates

[2024-09-20]: To better reflect the generality of our proposed method, we rename it to RagVL.
[2024-08-05]: Codes of RagVL (RagLLaVA) released.
[2024-07-31]: Paper of RagVL (RagLLaVA) online.

Getting Started

Environment Setup

The required libraries for running RagVL can be found in requirements.txt. We recommend following LLaVA to configure your environment.

Data Preparation

Before running RagVL, please:

Download from Google Drive for datasets and checkpoints.
Download from WebQA and MultimodalQA for image files.
Unzip the file. Place the checkpoints/ and datasets/ into RagVL/.
Place the tasks/ into RagVL/finetune/.
Place the MMQA_imgs/ and train_img/ into RagVL/finetune/tasks/.
Place the val_image/ into RagVL/datasets/.

Training

Reranker

| Models | Global Batch Size | Epochs | | --- | ---: | ---: | | LLaVA-v1.5-13B | 16 | 2 (WebQA) / 1 (others) | | Qwen-VL-Chat | 16 | 2 (WebQA) / 1 (others) | | mPLUG-Owl2 | 16 | 2 (WebQA) / 1 (others) | | InternVL2-1B | 16 | 1 | | InternVL2-2B | 16 | 1 |

Generator

| Models | Global Batch Size | Epochs | | --- | ---: | ---: | | LLaVA-v1.5-13B | 16 | 2 (WebQA) / 3 (MMQA) | | InternVL2-1B | 16 | 1 | | InternVL2-2B | 16 | 1 |

Except for the above two hyperparameters, the others follow the default settings from different models.

To finetune LLaVA-v1.5-13B, Qwen-VL-Chat, and mPLUG-Owl2, find the corresponding finetune script in RagVL/finetune/scripts/.

To finetune InternVL2-1B and InternVL2-2B, find the corresponding finetune script in RagVL/internvl_chat/shell/internvl2.0/2nd_finetune.

Evaluation

To evaluate RagVL on WebQA / MultimodalQA, you can employ the following command:

python webqa_pipeline.py \  # same arguments on mmqa_pipeline.py
--reranker_model caption_lora \ # select the reranker
--generator_model noise_injected_lora \ # select the generator
--filter 0 \ # select the adaptive threshold
--clip_topk 20 \ # we first retrieve 20 candidates by default

To evaluate the oracle settings on WebQA / MultimodalQA, you can employ the following command:

python webqa_oracle.py \  # same arguments on mmqa_oracle.py

Citation

If you are interested or inspired by this work, you can cite us by:

@article{chen2024mllm,
  title={MLLM Is a Strong Reranker: Advancing Multimodal Retrieval-augmented Generation via Knowledge-enhanced Reranking and Noise-injected Training},
  author={Chen, Zhanpeng and Xu, Chengjin and Qi, Yiyan and Guo, Jian},
  journal={arXiv preprint arXiv:2407.21439},
  year={2024}
}

Related Projects

LLaVA: Large Language and Vision Assistant
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
mPLUG-Owl: The Powerful Multi-modal Large Language Model Family
InternVL: A Pioneering Open-Source Alternative to GPT-4o
Visualized BGE: A universal multi-modal embedding model
VCD: Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding
CAL: Prioritizing Visual Correlation by Contrastive Alignment

RagVL

Install / Use

README

RagVL

Updates

Getting Started

Environment Setup

Data Preparation

Training

Evaluation

Citation

Related Projects