SkillAgentSearch skills...

WikiSeeker

[ACL 2026] WikiSeeker: Rethinking the Role of Vision-Language Models in Knowledge-Based Visual Question Answering.

Install / Use

/learn @zhuyjan/WikiSeeker
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<p align="center"> <img width="400" height="300" src="figures/WikiSeeker_Logo.png"> </p>

WikiSeeker: Rethinking the Role of Vision-Language Models in Knowledge-Based Visual Question Answering

License: Apache-2.0 GitHub Stars

🔍 Overview

<p align="center"><img src="figures/pipeline.png" alt="method" width="1000px" /></p>

We introduce WikiSeeker, a novel multi-modal RAG framework that bridges these gaps by proposing a multi-modal retriever and redefining the role of VLMs. Rather than serving merely as answer generators, we assign VLMs two specialized agents: a Refiner and an Inspector. The Refiner utilizes the capability of VLMs to rewrite the textual query according to the input image, significantly improving the performance of the multimodal retriever. The Inspector facilitates a decoupled generation strategy by selectively routing reliable retrieved context to another LLM for answer generation, while relying on the VLM’s internal knowledge when retrieval is unreliable.

🎯 Todo List

  • [ ] Release paper on Arxiv.
  • [ ] Publish the details of dataset processing.
  • [ ] Release the multi-modal retrieval code along with the corresponding knowledge base.
  • [ ] Release the RL training code for Refiner.

🧭 Acknowledgements

Our code is built upon EchoSight and DeepRetrieval. Thanks for their great work.

View on GitHub
GitHub Stars9
CategoryEducation
Updated3h ago
Forks0

Security Score

90/100

Audited on Apr 7, 2026

No findings