SkillAgentSearch skills...

EndoBench

[NeurIPS'25] EndoBench: A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis

Install / Use

/learn @CUHK-AIM-Group/EndoBench
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

EndoBench: A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis (NeurIPS’25 D&B)

<p align="center"> <img src="./assets/logo.png" alt="" width="120" height="120"> </p> <!-- <i>The avatar is generated by DALLE-3.</i> -->

🤖 Homepage | 🤗 Dataset | 📖 Paper

Shengyuan Liu<sup>1*</sup> Boyun Zheng<sup>1*</sup> Wenting Chen<sup>2*</sup> Zhihao Peng<sup>1</sup> Zhenfei Yin<sup>3</sup> Jing Shao<sup>4</sup> Jiancong Hu<sup>5</sup> Yixuan Yuan<sup>1✉</sup>

<sup>1</sup>Chinese University of Hong Kong   <sup>2</sup>City University of Hong Kong   <sup>3</sup>University of Oxford  

<sup>4</sup>Shanghai AI Laboratory   <sup>5</sup>The Sixth Affiliated Hospital, Sun Yat-sen University  

<sup>*</sup> Equal Contributions. <sup></sup> Corresponding Author.

This repository is the official implementation of the paper EndoBench: A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis.

✨ News

  • [01/03/2026] We release the EndoVQA-Instruct Dataset at here.
  • [21/10/2025] We release a new open-set challenging VQA benchmark, EndoBench-extended.
  • [19/09/2025] 🎉🎉Our EndoBench was accepted by NeurIPS'25 D&B Track!!!
  • [29/05/2025] The manuscript can be found on arXiv.

🚀Overview

In this paper, we introduce EndoBench, the first comprehensive benchmark specifically designed to assess MLLMs across the full spectrum of endoscopic practice with multi-dimensional capacities.

EndoBench encompasses 4 distinct endoscopic scenarios, 12 specialized clinical tasks with 12 secondary subtasks, and 5 levels of visual prompting granularities, resulting in 6,832 rigorously validated VQA pairs from 21 diverse datasets.

EndoBench establishes a new standard for evaluating and advancing MLLMs in endoscopy, highlighting both progress and persistent gaps between current models and expert clinical reasoning.

overview

🏥 EndoBench-Extended

The EndoBench-extended dataset is the extended version of Endobench, which includes 48 images and 91 open-set VQA pairs, carefully selected by medical experts to ensure clinical relevance and diagnostic difficulty. The EndoBench-extended specifically focuses on challenging and underrepresented cases, including rare pathologies, overlapping lesions, atypical anatomical presentations, and post-surgical endoscopic views.

📦Evaluation

We provide a comprehensive evaluation of the following MLLMs on EndoBench:

  1. This project is built upon VLMEvalKit. To get started:

Visit the VLMEvalKit Quickstart Guide for installation instructions. or you can run the following command for a quick start:

git clone https://github.com/CUHK-AIM-Group/EndoBench.git
cd EndoBench
pip install -e .
  1. You can evaluate your model with the following command:
python run.py --data EndoBench --model Your_model_name 

Demo: Qwen2.5-VL-7B-Instruct on EndoBench, Inference only

python run.py --data EndoBench --model Qwen2.5-VL-7B-Instruct --mode infer

We provide the Performance comparison across 4 major categories in EndoBench among existing MLLMs: comparison For More Details, please see our paper.

🔍 Insights

  1. Endoscopy remains a challenging domain for MLLMs, with significant gaps between models and human expertise. Human experts achieve an average accuracy of 74.12% in endoscopy tasks, while the top-performing model, Gemini-2.5-Pro, reaches only 49.53%—a gap of roughly 25%.

  2. Medical domain-specific Supervised Fine-Tuning markedly boosts model performance. Medical models that underwent domain-specific supervised fine-tuning, such as MedDr and HuatuoGPT-Vision-34B, perform exceptionally well in tasks like landmark identification and organ recognition, even outperforming all proprietary models.

  3. Model performance varies with visual prompt formats, exposing a gap between visual perception and medical comprehension. The ability of models to understand spatial information varies significantly based on how visual prompts are formatted.

  4. Polyp counting exposes dual challenges in lesion identification and numerical reasoning. Our findings highlight the importance of incorporating domain-specific medical knowledge into MLLMs to enhance their performance in tasks that combine visual analysis with clinical expertise.

🎈Acknowledgements

Greatly appreciate the tremendous effort for the following projects!

Note: This dataset is built based on multiple public datasets. The sources of these datasets have been clearly indicated in the paper. Users should abide by the relevant licenses and terms of use of the original datasets: Kvasir, HyperKvasir, Kvasir-Capsule, GastroVision, KID, WCEBleedGen, SEE-AI, Kvasir-Seg, CVC-ColonDB, ETIS-Larib, CVC-ClinicDB, CVC-300, EDD2020, SUN-Database, LDPolypVideo, PolypGen, Cholec80, EndoVis-17, EndoVis-18, and PSI-AVA.

Greatly appreciate all the authors of these datasets for their contributions to the field of endoscopy analysis.

📜Citation

If you find this work helpful for your project, please consider citing our paper.

@article{liu2025endobench,
  author={Shengyuan Liu and Boyun Zheng and Wenting Chen and Zhihao Peng and Zhenfei Yin and Jing Shao and Jiancong Hu and Yixuan Yuan},
  title={A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis},
  journal={arXiv preprint arXiv:2505.23601},
  year={2025}
}

Related Skills

View on GitHub
GitHub Stars58
CategoryDevelopment
Updated6d ago
Forks3

Languages

Python

Security Score

95/100

Audited on Mar 19, 2026

No findings