Endora
Endora: Video Generation Models as Endoscopy Simulators (MICCAI 2024)
Install / Use
/learn @CUHK-AIM-Group/EndoraREADME
Endora: Video Generation Models as Endoscopy Simulators (MICCAI 2024)
<p align="center"> <img src="./assets/avatar.png" alt="" width="120" height="120"> </p> <!-- <i>The avatar is generated by stable diffusion.</i> -->Project Page | ArXiv Paper | Video Demo
Accepted by International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2024)
Chenxin Li<sup>1*</sup> Hengyu Liu<sup>1*</sup> Yifan Liu<sup>1*</sup> Brandon Y. Feng<sup>2</sup> Wuyang Li<sup>1</sup> Xinyu Liu<sup>1</sup> Zhen Chen<sup>3</sup> Jing Shao<sup>4</sup> Yixuan Yuan<sup>1✉</sup>
<!-- [Yifan Liu](https://yifliu3.github.io/)<sup>1*</sup>, [Chenxin Li](https://xggnet.github.io/)<sup>1*</sup>, [Chen Yang](https://scholar.google.com/citations?user=C6fAQeIAAAAJ&hl)<sup>2</sup>, [Yixuan Yuan](https://www.ee.cuhk.edu.hk/en-gb/people/academic-staff/professors/prof-yixuan-yuan)<sup>1✉</sup> --><sup>1</sup>CUHK <sup>2</sup>MIT CSAIL <sup>3</sup>CAS CAIR <sup>4</sup>Shanghai AI Lab
<sup>*</sup> Equal Contributions. <sup>✉</sup> Corresponding Author.

💡Key Features
- A high-fidelity medical video generation framework, tested on endoscopy scenes, laying the groundwork for further advancements in the field.
- The first public benchmark for endoscopy video generation, featuring a comprehensive collection of clinical videos and adapting existing general-purpose generative video models for this purpose.
- A novel technique to infuse generative models with features distilled from a 2D visual foundation model, ensuring consistency and quality across different scales.
- Versatile ability through successful applications in video-based disease diagnosis and 3D surgical scene reconstruction, highlighting its potential for downstream medical tasks
🛠Setup
git clone https://github.com/XGGNet/Endora.git
cd Endora
conda create -n Endora python=3.10
conda activate Endora
pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
Tips A: We test the framework using pytorch=2.1.2, and the CUDA compile version=11.8. Other versions should be also fine but not totally ensured.
Tips B: GPU with 24GB (or more) is recommended for video sampling by <i>Endora</i> inference, and 48GB (or more) for <i>Endora</i> training.
📚Data Preparation
Colonoscopic: The dataset provided by paper can be found here. You can directly use the processed video data by Endo-FM without further data processing.
<!-- The dataset provided in [Colonoscopic](https://arxiv.org/abs/2206.15255) is used. You can download and process the dataset from their website (https://github.com/med-air/EndoNeRF). We use the two accessible clips including 'pulling_soft_tissues' and 'cutting_tissues_twice'. -->Kvasir-Capsule: The dataset provided by paper can be found here. You can directly use the processed video data by Endo-FM without further data processing.
<!-- The dataset provided in [Kvasir-Capsule](https://arxiv.org/abs/2206.15255) is used. You can download and process the dataset from their website (https://github.com/med-air/EndoNeRF). We use the two accessible clips including 'pulling_soft_tissues' and 'cutting_tissues_twice'. -->CholecTriplet: The dataset provided by paper can be found here. You can directly use the processed video data by Endo-FM without further data processing.
<!-- The dataset provided in [CholecTriplet](https://endovissub2019-scared.grand-challenge.org/) is used. To obtain a link to the data and code release, sign the challenge rules and email them to max.allan@intusurg.com. You will receive a temporary link to download the data and code. --> <!-- Follow [MICCAI_challenge_preprocess](https://github.com/EikoLoki/MICCAI_challenge_preprocess) to extract data. -->Please run process_data.py and process_list.py to get the split frames and the corresponding list at first.
CUDA_VISIBLE_DEVICES=gpu_id python process_data.py -s /path/to/datasets -t /path/to/save/video/frames
CUDA_VISIBLE_DEVICES=gpu_id python process_list.py -f /path/to/video/frames -t /path/to/save/text
The resulted file structure is as follows.
├── data
│ ├── CholecT45
│ ├── 00001.mp4
| ├── ...
│ ├── Colonoscopic
│ ├── 00001.mp4
| ├── ...
│ ├── Kvasir-Capsule
│ ├── 00001.mp4
| ├── ...
│ ├── CholecT45_frames
│ ├── train_128_list.txt
│ ├── 00001
│ ├── 00000.jpg
| ├── ...
| ├── ...
│ ├── Colonoscopic_frames
│ ├── train_128_list.txt
│ ├── 00001
│ ├── 00000.jpg
| ├── ...
| ├── ...
│ ├── Kvasir-Capsule_frames
│ ├── train_128_list.txt
│ ├── 00001
│ ├── 00000.jpg
| ├── ...
| ├── ...
🎇Sampling Endoscopy Videos
You can directly sample the endoscopy videos from the checkpoint model. Here is an example for quick usage for using our pre-trained models:
- Download the pre-trained weights from here and put them to specific path defined in the configs.
- Run
sample.pyby the following scripts to customize the various arguments like adjusting sampling steps.
Simple Sample to generate a video
bash sample/col.sh
bash sample/kva.sh
bash sample/cho.sh
DDP sample
bash sample/col_ddp.sh
bash sample/kva_ddp.sh
bash sample/cho_ddp.sh
<!-- or if you want to sample hundreds of videos, you can use the following script with Pytorch DDP:
```bash
bash sample/ffs_ddp.sh
``` -->
<!-- If you want to try generating videos from text, please download [`t2v_required_models`](https://huggingface.co/maxin-cn/Latte/tree/main/t2v_required_models) and run `bash sample/t2v.sh`. -->
⏳Training Endora
The weight of pretrained DINO can be found here, and in our implementation we use ViT-B/8 during training Endora. And the saved path need to be edited in ./configs
Train Endora with the resolution of 128x128 with N GPUs on the Colonoscopic dataset
torchrun --nnodes=1 --nproc_per_node=N train.py \
--config ./configs/col/col_train.yaml \
--port PORT \
--mode type_cnn \
--prr_weight 0.5 \
--pretrained_weights /path/to/pretrained/DINO
Run training Endora with scripts in ./train_scripts
bash train_scripts/col/train_col.sh
bash train_scripts/kva/train_kva.sh
bash train_scripts/cho/train_cho.sh
<!-- We provide the scripts [`train_endora.py`](train_with_img.py). -->
<!-- Similar to [`train.py`](train.py) scripts, this scripts can be also used to train class-conditional and unconditional
Latte models. For example, if you wan to train Latte model on the FaceForensics dataset, you can use -->
<!-- We provide a training script for Latte in [`train.py`](train.py). This script can be used to train class-conditional and unconditional
Latte models. To launch Latte (256x256) training with `N` GPUs on the FaceForensics dataset: -->
<!--
```bash
torchrun --nnodes=1 --nproc_per_node=N train.py --config ./configs/ffs/ffs_train.yaml
``` -->
<!-- or If you have a cluster that uses slurm, you can also train Latte's model using the following scripts:
```bash
sbatch slurm_scripts/ffs.slurm
``` -->
<!-- We also provide the video-image joint training scripts [`train_with_img.py`](train_with_img.py). Similar to [`train.py`](train.py) scripts, this scripts can be also used to train class-conditional and unconditional
Latte models. For example, if you wan to train Latte model on the FaceForensics dataset, you can use:
```bash
torchrun --nnodes=1 --nproc_per_node=N train_with_img.py --config ./configs/ffs/ffs_img_train.yaml
``` -->
<!--
For training video generation on, run
```
python train.py -s data/endonerf/pulling --port 6017 --expname endonerf/pulling --configs arguments/endonerf/pulling.py
```
You can customize your training config through the config files.
-->
📏Metric Evaluation
We first split the generated videos to frames and use the code fr
