EndoViT

Large-scale Self-supervised Pre-training of Vision Transformers (ViT) on endoscopic images.

Official codebase of the paper: EndoViT: pretraining vision transformers on a large collection of endoscopic images

Earlier arXiv version (without semantic-segmentation) can be found here: Whether and When does Endoscopy Domain Pretraining Make Sense?

Authors: Dominik Batić, Felix Holm, Ege Özsoy, Tobias Czempiel, Nassir Navab

@article{batic2023whether,
  title={Whether and When does Endoscopy Domain Pretraining Make Sense?},
  author={Bati{\'c}, Dominik and Holm, Felix and {\"O}zsoy, Ege and Czempiel, Tobias and Navab, Nassir},
  journal={arXiv preprint arXiv:2303.17636},
  year={2023}
}

Quick-Start

Checkout out our 🤗 <a href="https://huggingface.co/egeozsoy/EndoViT" target="_blank">Hugging Face</a> page for a guide on using EndoViT as Feature Extractor (Either Frozen or as a Backbone to be Fine-tuned). Alternatively you can take a look at the endovit_demo.py

Pre-trained EndoViT Checkpoints

To prevent data leakage for our evaluation, we excluded the test set for backbone training of our segmentation, action triplet recognition, and surgical phase recognition tasks, respectively. You can find the weights for each of these versions of the backbone below.

| Excluded Data (Test Sets) | Checkpoint | |--------------------------------------------|------------------------------------------------------------------------------------------------------| | CholecSeg8k (Segmentation) | EndoViT_Seg | CholecT45 (Action Triplet Detection) | EndoViT ATD | Cholec80 (Surgical Phase Recognition) | EndoViT_SPR

Use these checkpoints if you wish to skip EndoViT's pre-training.

Introduction

The development of novel Computer Vision (CV) methods in the medical field has been largely constrained by the lack of publicly available annotated data. Patient data and recorded surgical procedures are hard to obtain. They are considered highly sensitive information and therefore protected by numerous laws. Even the annotation procedure is complicated, often requiring the involvement of multiple medical experts.

Consequently, public medical datasets are scarce, and the existing ones contain far fewer annotated images than the CV datasets used for the same task. Pre-training has been shown as a viable strategy to mitigate the downsides of training on small datasets. However, most medical works use models pre-trained on natural images, creating a domain gap between pre-training and fine-tuning.

In this work, we explore the possibilities of pre-training models specifically for the use in endoscopic domain. To this end, we turn to Vision Transformers. Given the extreme number of parameters they contain, a large amount of data is needed to properly train them. Therefore, self-supervised pre-training strategies were developed, splitting the use of Transformers into two parts. First, a Transformer is pre-trained using a large collection of raw unlabelled data to produce a model with a general understanding of the underlying domain. Afterwards, the resulting model is fine-tuned for a specific downstream task. This can now be done with significantly less labelled data.

Project Description

The fact Vision Transformers can be pre-trained using raw data only prompted us to combine the existing smaller medical datasets into a larger collection. To this end, we introduce Endo700k, a collection of 9 publicly available endoscopic datasets comprising more than 700,000 unlabelled images. The overview of the included datasets is given in the table below.

Endo700k dataset collection

| # | Dataset | # Images | |:-:|:-----------------------:|----------:| | 1 | HeiCo | 347,257 | | 2 | Cholec80 | 184,498 | | 3 | PSI-AVA | 73,618 | | 4 | ESAD | 49,544 | | 5 | LapGyn4 (v1.2) | 38,192 | | 6 | hSDB-instrument | 35,576 | | 7 | DSAD | 13,195 | | 8 | GLENDA (v1.0) | 1,083 | | 9 | SurgicalActions160 | 761 | | - | Total |743,724|

Using Endo700k we pre-train a Vision Transformer model following Masked Autoencoder (MAE) approach. An input image is divided into equally-sized patches and a large proportion of them (75%) is masked out. The transformer is then tasked with reconstructing the missing input. Although a simple concept, it represents a challenging self-supervised task that induces a comprehensive understanding of observed objects and scenes. Afterwards, the pre-trained ViT model can be fine-tuned as a feature extraction backbone on various downstream tasks. We visualize the pre-training and fine-tuning procedure in the following image.

Finally, we evaluated EndoViT's performance on three downstream tasks:

Semantic Segmentation on the CholecSeg8k dataset,
Action Triplet Detection on the CholecT45 dataset and
Surgical Phase Recognition on the Cholec80 dataset.

We primarily compare EndoViT's performance to its ImageNet pre-trained ViT counterpart.

Usage

1) Clone the repository:

git clone https://github.com/DominikBatic/EndoViT.git endovit
cd endovit

NOTE: We have organized the repo so that everything should be run from the root ("endovit") directory.

Requirements:

2) Copy our "endovit" conda environment:

conda env create -f conda_environment.yml
conda activate endovit

Download Endo700k:

NOTE: By downloading any of the datasets you agree to the terms of their use. Please check the corresponding LICENCEs.

3) Download Cholec80 (GitHub) (LICENSE) (Request Form)

We use a slightly modified download script from the original repository ("prepare.py").

python ./datasets/Cholec80/download_cholec80.py --data_rootdir ./datasets/

The dataset can now be found at: ./datasets/Cholec80.

4) Download and Prepare the Other Datasets

We have written a helper script to download and pre-process all other datasets. The following command will download and remove all unnecessary data except the raw images necessary for EndoViT pre-training. If you need the full datasets, please read the use instructions at the beginning of the script.
You will need at least 700 GB of memory to download the datasets. After pre-processing, the dataset will be around 150 GB.
To download HeiCo dataset, you need first to create a Synapse Account. Afterwards, pass your email and password as the arguments in the command below.

python ./datasets/Endo700k/download_and_prepare_Endo700k.py --all --synapse_email YOUR_EMAIL --synapse_password YOUR_PASSWORD

The datasets can now be found at ./datasets/Endo700k.

Pre-train EndoViT:

Since EndoViT is a collection of diverse datasets, we wanted to specialize our pre-training towards one of them. For this reason, all our downstream tasks are conducted on the Cholec80 data (CholecT45 and CholecSeg8k are subsets of Cholec80 with different annotations).
To avoid data leakage, validation and test images of the downstream datasets had to be removed from the pre-training. Since Cholec80, CholecT45 and CholecSeg8k have different train/val/test splits, we decided to pre-train three models: one for each task by removing validation and test images of the corresponding downstream dataset from EndoViT's Cholec80 section.
Additionally, we created a validation dataset consisting of only Cholec80 images. We frequently evaluate our pre-trained models on it and save the best-performing ones, which implicitly assigns a higher weight to the Cholec80 images during pre-training.

5) Prepare Cholec80

The following script prepares three different subvariants of Cholec80. Each one will be used to pre-train EndoViT for different downstream task.
Additionally, it creates the pre-training validation dataset.

python ./datasets/Cholec80/prepare_cholec80.py

The datasets can now be found at: ./datasets/Endo700k under: Cholec80_for_Segmentation, Cholec80_for_ActionTripletDetection and Cholec80_for_SurgicalPhaseRecognition.
The validation dataset can now be found at: ./datasets/validation_dataset/Cholec80_for_Validation.

**6) Down

EndoViT

Install / Use

README

EndoViT

Quick-Start

Pre-trained EndoViT Checkpoints

Introduction

Project Description

Endo700k dataset collection

Usage

Requirements:

Download Endo700k:

Pre-train EndoViT: