HistoSSLscaling
Code associated to the publication: Scaling self-supervised learning for histopathology with masked image modeling, A. Filiot et al., MedRxiv (2023). We publicly release Phikon š
Install / Use
/learn @owkin/HistoSSLscalingREADME
[MedRxiv] [Project page] [Paper]
Filiot, A., Ghermi, R., Olivier, A., Jacob, P., Fidon, L., Kain, A. M., Saillard, C., & Schiratti, J.-B. (2023). Scaling Self-Supervised Learning for Histopathology with Masked Image Modeling. MedRxiv.
@article{Filiot2023scalingwithMIM,
author = {Alexandre Filiot and Ridouane Ghermi and Antoine Olivier and Paul Jacob and Lucas Fidon and Alice Mac Kain and Charlie Saillard and Jean-Baptiste Schiratti},
title = {Scaling Self-Supervised Learning for Histopathology with Masked Image Modeling},
elocation-id = {2023.07.21.23292757},
year = {2023},
doi = {10.1101/2023.07.21.23292757},
publisher = {Cold Spring Harbor Laboratory Press},
url = {https://www.medrxiv.org/content/early/2023/07/26/2023.07.21.23292757v2},
eprint = {https://www.medrxiv.org/content/early/2023/07/26/2023.07.21.23292757v2.full.pdf},
journal = {medRxiv}
}
</details>
Update :tada: Phikon release on Hugging Face :tada:
We released our Phikon model on Hugging Face. Check out our community blog post ! We also provide a Colab notebook to perform weakly-supervised learning on Camelyon16 and fine-tuning with LoRA on NCT-CRC-HE using Phikon.
Here is a code snippet to perform feature extraction using Phikon.
from PIL import Image
import torch
from transformers import AutoImageProcessor, ViTModel
# load an image
image = Image.open("assets/example.tif")
# load phikon
image_processor = AutoImageProcessor.from_pretrained("owkin/phikon")
model = ViTModel.from_pretrained("owkin/phikon", add_pooling_layer=False)
# process the image
inputs = image_processor(image, return_tensors="pt")
# get the features
with torch.no_grad():
outputs = model(**inputs)
features = outputs.last_hidden_state[:, 0, :] # (1, 768) shape
Official PyTorch Implementation and pre-trained models for Scaling Self-Supervised Learning for Histopathology with Masked Image Modeling. This minimalist repository aims to:
- Publicly release the weights of our Vision Transformer Base (ViT-B) model Phikon pre-trained with iBOT on 40M pan-cancer histology tiles from TCGA. Phikon achieves state-of-the-art performance on a large variety of downstream tasks compared to other SSL frameworks available in the literature.
ā ļø Addendum :warning:
From 09.01.2023 to 10.30.2023, this repository stated using the student, please use the teacher backbone instead.
# feature extraction snippet with `rl_benchmarks` repository
from PIL import Image
from rl_benchmarks.models import iBOTViT
# instantiate iBOT ViT-B Pancancer model, aka Phikon
# /!\ please use the "teacher" encoder which produces better results !
weights_path = "/<your_root_dir>/weights/ibot_vit_base_pancan.pth">
ibot_base_pancancer = iBOTViT(architecture="vit_base_pancan", encoder="teacher", weights_path=weights_path)
# load an image and transform it into a normalized tensor
image = Image.open("assets/example.tif") # (224, 224, 3), uint8
tensor = ibot_base_pancancer.transform(image) # (3, 224, 224), torch.float32
batch = tensor.unsqueeze(0) # (1, 3, 224, 224), torch.float32
# compute the 768-d features
features = ibot_base_pancancer(batch).detach().cpu().numpy()
assert features.shape == (1, 768)
- Publicly release the histology features of our ViT-based iBOT models (
iBOT[ViT-S]COAD,iBOT[ViT-B]COAD,iBOT[ViT-B]PanCancer,iBOT[ViT-L]COAD) for i) 11 TCGA cohorts and Camelyon16 slides datasets; and ii) NCT-CRC and Camelyon17-Wilds patches datasets. - Reproduce the results from our publication, including: features extraction and clinical data processing, cross-validation experiments, results generation.
Abstract
<details> <summary> Read full abstract from MedRxiv.
Data structure
Download
You can download the data necessary to use the present code and reproduce our results here:
- raw data: Google Drive
- preprocessed data: Google Drive
- weights: Google Drive
Please create weights, raw and preprocessed folders containing the content of the different downloads. This step may take time depending on your wifi bandwidth (folder takes 1.2 To). You can use rclone to download the folder from a remote machine (preferred in a tmux session).
Description
The bucket contains three main folders: a weights, raw and preprocessed folders. The weights folder contains weights for iBOT[ViT-B]PanCancer (our best ViT-B iBOT model). Other models from the literature can be retrieved from the corresponding Github repositories:
- CTransPath: https://github.com/Xiyue-Wang/TransPath
- HIPT: https://github.com/mahmoodlab/HIPT
- Dino[ViT-S]BRCA: https://github.com/Richarizardd/Self-Supervised-ViT-Path
weights/
āāā ibot_vit_base_pancan.pth # Ours
The raw folder contains two subfolders for slide-level and tile-level downstream task.
- Slide-level: each cohort contains 2 folders,
clinicalandslides. We provide clinical data but not raw slides. No modification was performed on the folders architectures and files names of raw slides and patches compared to the original source (i.e. TCGA, Camelyon16, NCT-CRC and Camelyon17-WILDS). - Tile-level: each cohort contains 2 folders,
clinicalandpatches. We only provide clinical data (i.e. labels), not patches datasets.
[!WARNING] We don't provide raw slides or patches (
slides,patchesfolders are empty). You can download raw slides or patches here:
- PAIP: http://www.wisepaip.org/paip/guide/dataset
- TCGA: https://portal.gdc.cancer.gov/
- Camelyon16: http://gigadb.org/dataset/100439
- NCT-CRC: https://zenodo.org/record/1214456
- Camelyon17-WILDS: https://github.com/p-lambda/wilds/blob/main/wilds/download_datasets.py
Once you downloaded the data, please follow the same folders architecture as indicated below (without applying modifications on folders and files names compared to original download).
raw/
āāā slides_classification # slides classification tasks
===============================================================================
āĀ Ā āāā CAMELYON16_FULL # cohort
āĀ Ā āĀ Ā āāā clinical # clinical data (for labels)
āĀ Ā āĀ Ā āĀ Ā āāā test_clinical_data.csv
āĀ Ā ā āĀ Ā āāā train_clinical_data.csv
āĀ Ā āĀ Ā āāā slides # raw slides (not provided)
āĀ Ā āĀ Ā āāā Normal_001.tif
āĀ Ā āĀ Ā Ā āāā Normal_002.tif...
āĀ Ā āāā TCGA
āĀ Ā āāā tcga_statistics.pk # For each cohort and label, list (n_patients, n_slides, labels_distribution)
āĀ Ā Ā āāā clinical # for TCGA, clinical data is divided into subfolders
āĀ Ā Ā āĀ Ā āāā hrd
āĀ Ā Ā āĀ Ā āĀ Ā Ā āāā hrd_labels_tcga_brca.csv
āĀ Ā Ā ā āĀ āāā hrd_labels_tcga_ov.csv
āĀ Ā Ā āĀ Ā āāā msi
āĀ Ā Ā āĀ Ā āĀ Ā āāā msi_labels_tcga_coad.csv
āĀ Ā Ā āĀ Ā āĀ Ā āāā msi_labels_tcga_read.csv...
āĀ Ā Ā āĀ Ā āāā subtypes
āĀ Ā Ā āĀ Ā āĀ Ā āāā brca_tcga_pan_can_atlas_2018_clinical_data.tsv.gz
āĀ Ā Ā āĀ Ā āĀ Ā āāā coad_tcga_pan_can_atlas_2018_clinical_data.tsv.gz...
ā āĀ Ā āāā survival
āĀ Ā Ā āĀ Ā Ā Ā āāā survival_labels_tcga_brca.csv
āĀ Ā Ā āĀ Ā Ā Ā āāā survival_labels_tcga_coad.csv...
ā āāā slides
āĀ Ā Ā āāā parafine
āĀ Ā Ā Ā Ā āāā TCGA_BRCA
āĀ Ā Ā Ā āĀ āāā 03627311-e413-4218-b836-177abdfc3911
āĀ Ā Ā Ā ā āĀ Ā āāā TCGA-XF-AAN7-01Z-00-DX1.B8EDF045-604C-48CB-8E54-A60564CAE2AD.svs
...
āāā tiles_classific
