EndoBench

[NeurIPS'25] EndoBench: A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis

Generate Convert Improve

Install / Use

/learn @CUHK-AIM-Group/EndoBench

About this skill

Quality Score

0/100

README

EndoBench: A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis (NeurIPS’25 D&B)

🤖 Homepage | 🤗 Dataset | 📖 Paper

Shengyuan Liu1* Boyun Zheng1* Wenting Chen2* Zhihao Peng1 Zhenfei Yin3 Jing Shao4 Jiancong Hu5 Yixuan Yuan1✉

1Chinese University of Hong Kong 2City University of Hong Kong 3University of Oxford

4Shanghai AI Laboratory 5The Sixth Affiliated Hospital, Sun Yat-sen University

* Equal Contributions. ✉ Corresponding Author.

This repository is the official implementation of the paper EndoBench: A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis.

✨ News

[01/03/2026] We release the EndoVQA-Instruct Dataset at here.
[21/10/2025] We release a new open-set challenging VQA benchmark, EndoBench-extended.
[19/09/2025] 🎉🎉Our EndoBench was accepted by NeurIPS'25 D&B Track!!!
[29/05/2025] The manuscript can be found on arXiv.

🚀Overview

In this paper, we introduce EndoBench, the first comprehensive benchmark specifically designed to assess MLLMs across the full spectrum of endoscopic practice with multi-dimensional capacities.

EndoBench encompasses 4 distinct endoscopic scenarios, 12 specialized clinical tasks with 12 secondary subtasks, and 5 levels of visual prompting granularities, resulting in 6,832 rigorously validated VQA pairs from 21 diverse datasets.

EndoBench establishes a new standard for evaluating and advancing MLLMs in endoscopy, highlighting both progress and persistent gaps between current models and expert clinical reasoning.

overview

🏥 EndoBench-Extended

The EndoBench-extended dataset is the extended version of Endobench, which includes 48 images and 91 open-set VQA pairs, carefully selected by medical experts to ensure clinical relevance and diagnostic difficulty. The EndoBench-extended specifically focuses on challenging and underrepresented cases, including rare pathologies, overlapping lesions, atypical anatomical presentations, and post-surgical endoscopic views.

📦Evaluation

We provide a comprehensive evaluation of the following MLLMs on EndoBench:

This project is built upon VLMEvalKit. To get started:

Visit the VLMEvalKit Quickstart Guide for installation instructions. or you can run the following command for a quick start:

git clone https://github.com/CUHK-AIM-Group/EndoBench.git
cd EndoBench
pip install -e .

You can evaluate your model with the following command:

python run.py --data EndoBench --model Your_model_name

Demo: Qwen2.5-VL-7B-Instruct on EndoBench, Inference only

python run.py --data EndoBench --model Qwen2.5-VL-7B-Instruct --mode infer

We provide the Performance comparison across 4 major categories in EndoBench among existing MLLMs: For More Details, please see our paper.

🔍 Insights

Endoscopy remains a challenging domain for MLLMs, with significant gaps between models and human expertise. Human experts achieve an average accuracy of 74.12% in endoscopy tasks, while the top-performing model, Gemini-2.5-Pro, reaches only 49.53%—a gap of roughly 25%.
Medical domain-specific Supervised Fine-Tuning markedly boosts model performance. Medical models that underwent domain-specific supervised fine-tuning, such as MedDr and HuatuoGPT-Vision-34B, perform exceptionally well in tasks like landmark identification and organ recognition, even outperforming all proprietary models.
Model performance varies with visual prompt formats, exposing a gap between visual perception and medical comprehension. The ability of models to understand spatial information varies significantly based on how visual prompts are formatted.
Polyp counting exposes dual challenges in lesion identification and numerical reasoning. Our findings highlight the importance of incorporating domain-specific medical knowledge into MLLMs to enhance their performance in tasks that combine visual analysis with clinical expertise.

🎈Acknowledgements

Greatly appreciate the tremendous effort for the following projects!

Note: This dataset is built based on multiple public datasets. The sources of these datasets have been clearly indicated in the paper. Users should abide by the relevant licenses and terms of use of the original datasets: Kvasir, HyperKvasir, Kvasir-Capsule, GastroVision, KID, WCEBleedGen, SEE-AI, Kvasir-Seg, CVC-ColonDB, ETIS-Larib, CVC-ClinicDB, CVC-300, EDD2020, SUN-Database, LDPolypVideo, PolypGen, Cholec80, EndoVis-17, EndoVis-18, and PSI-AVA.

Greatly appreciate all the authors of these datasets for their contributions to the field of endoscopy analysis.

📜Citation

If you find this work helpful for your project, please consider citing our paper.

@article{liu2025endobench,
  author={Shengyuan Liu and Boyun Zheng and Wenting Chen and Zhihao Peng and Zhenfei Yin and Jing Shao and Jiancong Hu and Yixuan Yuan},
  title={A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis},
  journal={arXiv preprint arXiv:2505.23601},
  year={2025}
}

Related Skills

node-connect

334.5k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

82.2k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

334.5k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

commit-push-pr

82.2k

Commit, push, and open a PR