WikiSeeker
[ACL 2026] WikiSeeker: Rethinking the Role of Vision-Language Models in Knowledge-Based Visual Question Answering.
Install / Use
/learn @zhuyjan/WikiSeekerREADME
WikiSeeker: Rethinking the Role of Vision-Language Models in Knowledge-Based Visual Question Answering
🔍 Overview
<p align="center"><img src="figures/pipeline.png" alt="method" width="1000px" /></p>We introduce WikiSeeker, a novel multi-modal RAG framework that bridges these gaps by proposing a multi-modal retriever and redefining the role of VLMs. Rather than serving merely as answer generators, we assign VLMs two specialized agents: a Refiner and an Inspector. The Refiner utilizes the capability of VLMs to rewrite the textual query according to the input image, significantly improving the performance of the multimodal retriever. The Inspector facilitates a decoupled generation strategy by selectively routing reliable retrieved context to another LLM for answer generation, while relying on the VLM’s internal knowledge when retrieval is unreliable.
🎯 Todo List
- [ ] Release paper on Arxiv.
- [ ] Publish the details of dataset processing.
- [ ] Release the multi-modal retrieval code along with the corresponding knowledge base.
- [ ] Release the RL training code for Refiner.
🧭 Acknowledgements
Our code is built upon EchoSight and DeepRetrieval. Thanks for their great work.
Security Score
Audited on Apr 7, 2026
