WikiSeeker: Rethinking the Role of Vision-Language Models in Knowledge-Based Visual Question Answering

🔍 Overview

We introduce WikiSeeker, a novel multi-modal RAG framework that bridges these gaps by proposing a multi-modal retriever and redefining the role of VLMs. Rather than serving merely as answer generators, we assign VLMs two specialized agents: a Refiner and an Inspector. The Refiner utilizes the capability of VLMs to rewrite the textual query according to the input image, significantly improving the performance of the multimodal retriever. The Inspector facilitates a decoupled generation strategy by selectively routing reliable retrieved context to another LLM for answer generation, while relying on the VLM’s internal knowledge when retrieval is unreliable.

🎯 Todo List

[ ] Release paper on Arxiv.
[ ] Publish the details of dataset processing.
[ ] Release the multi-modal retrieval code along with the corresponding knowledge base.
[ ] Release the RL training code for Refiner.

🧭 Acknowledgements

Our code is built upon EchoSight and DeepRetrieval. Thanks for their great work.

WikiSeeker

Install / Use

README

WikiSeeker: Rethinking the Role of Vision-Language Models in Knowledge-Based Visual Question Answering

🔍 Overview

🎯 Todo List

🧭 Acknowledgements