RagVL
Official PyTorch Implementation of MLLM Is a Strong Reranker: Advancing Multimodal Retrieval-augmented Generation via Knowledge-enhanced Reranking and Noise-injected Training.
Install / Use
/learn @DataArcTech/RagVLREADME
RagVL
This is the official repo for the paper: "MLLM Is a Strong Reranker: Advancing Multimodal Retrieval-augmented Generation via Knowledge-enhanced Reranking and Noise-injected Training".

Updates
- [2024-09-20]: To better reflect the generality of our proposed method, we rename it to RagVL.
- [2024-08-05]: Codes of RagVL (RagLLaVA) released.
- [2024-07-31]: Paper of RagVL (RagLLaVA) online.
Getting Started
Environment Setup
The required libraries for running RagVL can be found in requirements.txt. We recommend following LLaVA to configure your environment.
Data Preparation
Before running RagVL, please:
-
Download from Google Drive for datasets and checkpoints.
-
Download from WebQA and MultimodalQA for image files.
-
Unzip the file. Place the
checkpoints/anddatasets/intoRagVL/. -
Place the
tasks/intoRagVL/finetune/. -
Place the
MMQA_imgs/andtrain_img/intoRagVL/finetune/tasks/. -
Place the
val_image/intoRagVL/datasets/.
Training
- Reranker
| Models | Global Batch Size | Epochs | | --- | ---: | ---: | | LLaVA-v1.5-13B | 16 | 2 (WebQA) / 1 (others) | | Qwen-VL-Chat | 16 | 2 (WebQA) / 1 (others) | | mPLUG-Owl2 | 16 | 2 (WebQA) / 1 (others) | | InternVL2-1B | 16 | 1 | | InternVL2-2B | 16 | 1 |
- Generator
| Models | Global Batch Size | Epochs | | --- | ---: | ---: | | LLaVA-v1.5-13B | 16 | 2 (WebQA) / 3 (MMQA) | | InternVL2-1B | 16 | 1 | | InternVL2-2B | 16 | 1 |
Except for the above two hyperparameters, the others follow the default settings from different models.
To finetune LLaVA-v1.5-13B, Qwen-VL-Chat, and mPLUG-Owl2, find the corresponding finetune script in RagVL/finetune/scripts/.
To finetune InternVL2-1B and InternVL2-2B, find the corresponding finetune script in RagVL/internvl_chat/shell/internvl2.0/2nd_finetune.
Evaluation
To evaluate RagVL on WebQA / MultimodalQA, you can employ the following command:
python webqa_pipeline.py \ # same arguments on mmqa_pipeline.py
--reranker_model caption_lora \ # select the reranker
--generator_model noise_injected_lora \ # select the generator
--filter 0 \ # select the adaptive threshold
--clip_topk 20 \ # we first retrieve 20 candidates by default
To evaluate the oracle settings on WebQA / MultimodalQA, you can employ the following command:
python webqa_oracle.py \ # same arguments on mmqa_oracle.py
Citation
If you are interested or inspired by this work, you can cite us by:
@article{chen2024mllm,
title={MLLM Is a Strong Reranker: Advancing Multimodal Retrieval-augmented Generation via Knowledge-enhanced Reranking and Noise-injected Training},
author={Chen, Zhanpeng and Xu, Chengjin and Qi, Yiyan and Guo, Jian},
journal={arXiv preprint arXiv:2407.21439},
year={2024}
}
Related Projects
- LLaVA: Large Language and Vision Assistant
- Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
- mPLUG-Owl: The Powerful Multi-modal Large Language Model Family
- InternVL: A Pioneering Open-Source Alternative to GPT-4o
- Visualized BGE: A universal multi-modal embedding model
- VCD: Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding
- CAL: Prioritizing Visual Correlation by Contrastive Alignment
