MADQA: Multimodal Agentic Document QA

Code accompanying the paper "Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections".

Leaderboard: MADQA Leaderboard | Dataset: OxRML/MADQA | Paper: arXiv

MADQA is a benchmark of 2,250 human-authored questions grounded in 800 heterogeneous PDF documents, designed to evaluate agentic reasoning over document collections.

<img src="hero-process.jpg" width="30%" alt="MADQA reasoning process"/> Figure: MADQA reasoning workflow. Given a query q over a document corpus C, the agentic system iteratively retrieves pages, reasons over visual and textual content, and aggregates evidence across pages to produce an attributed answer

Repository Structure

├── baselines/
│   ├── bm25-mllm-agent/    # BM25 MLLM Agent — iterative text search + visual reasoning (§4, App. G.1)
│   ├── recursive-lm/       # Recursive Language Models — programmatic context decomposition (§4, App. G.5)
│   ├── heaven/              # HEAVEN — hybrid visual retrieval (§4, App. G.3)
│   ├── gemini-file-search/  # Gemini File Search — managed RAG service (§4, App. G.2)
│   ├── openai-file-search/  # OpenAI File Search — managed RAG service (§4, App. G.2)
│   └── utils/               # Shared utilities (OCR preprocessing)
├── eval/                    # Evaluation CLI: Accuracy, ANLS*, Semantic Accuracy, Page F1, Kuiper (§3)
├── examples/                # Dataset loading examples
└── LICENSE

Each subdirectory has its own README with setup and usage instructions.

Quick Start

Load the dataset:

from datasets import load_dataset

dataset = load_dataset("OxRML/MADQA")

Evaluate predictions:

pip install -r eval/requirements.txt
python eval/evaluate.py results.jsonl
python eval/evaluate.py results.jsonl --semantic          # with LLM judge
python eval/evaluate.py m1.jsonl m2.jsonl --compare       # compare systems

See examples/ for more.

Example MADQA Question

<img src="hero-example.jpg" width="70%" alt="Example MADQA question"/> Figure: Example MADQA question requiring reasoning across multiple documents. The answer requires retrieving evidence from different reports and aggregating values across years.

License

Apache 2.0

Citation

If you use MADQA in your research, please cite:

@misc{borchmann2026madqa,
  title={Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections},
  author={Łukasz Borchmann and Jordy Van Landeghem and Michał Turski and Shreyansh Padarha and Ryan Othniel Kearns and Adam Mahdi and Niels Rogge and Clémentine Fourrier and Siwei Han and Huaxiu Yao and Artemis Llabrés and Yiming Xu and Dimosthenis Karatzas and Hao Zhang and Anupam Datta},
  year={2026},
  eprint={2603.12180},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2603.12180},
}

MADQA

Install / Use