Snuffy
Snuffy: Efficient Whole Slide Image Classifier For Efficient and Performant Diagnosis in Pathology Whole Slide Images
Install / Use
/learn @jafarinia/SnuffyREADME
Snuffy: Efficient Whole Slide Image Classifier
Hossein Jafarinia, Alireza Alipanah, Danial Hamdi, Saeed Razavi, Nahal Mirzaie, Mohammad Hossein Rohban
[arXiv] [Project Page] [Demo] [BibTex]
PyTorch implementation for the Multiple Instance Learning framework described in the paper Snuffy: Efficient Whole Slide Image Classifier (ECCV 2024, accepted).
<p> <img src="figs/architecture.png"> </p>
Snuffy is a novel MIL-pooling method based on sparse transformers, designed to address the computational challenges in Whole Slide Image (WSI) classification for digital pathology. Our approach mitigates performance loss with limited pre-training and enables continual few-shot pre-training as a competitive option.
Key features:
- Tailored sparsity pattern for pathology
- Theoretically proven universal approximator with tight probabilistic sharp bounds
- Superior WSI and patch-level accuracies on CAMELYON16 and TCGA Lung cancer datasets
Overview
This repository provides a complete, runnable implementation of the Snuffy framework, including code for the FROC metric, which is unique among WSI classification frameworks to the best of our knowledge.
- Slide Patching: WSIs are divided into manageable patches.
- Self-Supervised Learning: An SSL method is trained on the patches to create an embedder.
- Feature Extraction: The embedder computes features (embeddings) for each slide.
- MIL Training: The Snuffy MIL framework is applied to the computed features.
Each step in this pipeline can be executed independently, with intermediate results available for download to facilitate continued processing.
<details> <summary>Table of Contents</summary> <ol> <li><a href="#requirements">Requirements</a></li> <li><a href="#dataset-download">Dataset Download</a></li> <li><a href="#train-val-test-split">Train/Val/Test Split</a></li> <li><a href="#slide-preparation-patching-and-n-shot-dataset-creation">Slide Preparation: Patching and N-Shot Dataset Creation</a></li> <li><a href="#training-the-embedder">Training the Embedder</a></li> <li><a href="#feature-extraction">Feature Extraction</a></li> <li><a href="#mil-training">MIL Training</a></li> <li><a href="#visualization">Visualization</a></li> <li><a href="#acknowledgement">Acknowledgement</a></li> <li><a href="#citation">Citation</a></li> </ol> </details>Requirements
System Requirements
- Operating System: Ubuntu 20.04 LTS (or compatible Linux distribution)
- Python Version: 3.8 or later
- GPU: Recommended for faster processing (CUDA-compatible)
Notes
- Disk Space: Ensure you have sufficient disk space for dataset downloads and processing, especially if you intend to work with raw slides rather than pre-computed embeddings. Raw slide data can be very large.
- Hardware: The MIL training code can run on both GPU and CPU. For optimal performance, a GPU is strongly recommended.
Downloading and Preparing Datasets
- Amazon CLI: To download the CAMELYON16 dataset's raw whole-slide images, you'll need the AWS CLI. Install it by:
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
./aws/install
-
GDC Client (For downloading the TCGA dataset): This is automatically downloaded and installed when you use the
download_tcga_lung.shscript. -
OpenSlide is necessary if you intend to patch the slides yourself using the
deepzoom_tiler_camelyon16.pyordeepzoom_tiler_tcga_lung_cancer.pyscripts. Install OpenSlide with:
# Update package list and install OpenSlide
apt-get update
apt-get install openslide-tools
Running Snuffy
- The ASAP package is required for calculating the FROC
metric.
Install ASAP and its
multiresolutionimageinterfacePython package as follows:
# Download and install ASAP
wget https://github.com/computationalpathologygroup/ASAP/releases/download/ASAP-2.1/ASAP-2.1-py38-Ubuntu2004.deb
apt-get install -f "./ASAP-2.1-py38-Ubuntu2004.deb"
- Required Python packages can be installed with:
# Install Python packages from requirements.txt
pip install -r requirements.txt
Note: The requirements.txt file includes specific package versions used and verified in our experiments. However,
newer versions available in your environment may also be compatible.
Additional Components
-
MAE with Adapter: Refer to the MAE repository for installation instructions.
Important: If using PyTorch versions 1.8+ , follow the instructions in the MAE repository to fix compatibility issue with the
timmmodule. Alternatively, run the following script to fix the issue.chmod +x requirements_timm_patch.sh ./requirements_timm_patch.shNote that we've also included a modified version of timm, to support adapter functionality.
Download Data
CAMELYON16
-
List and Download Dataset: Run the following commands to list and download the CAMELYON16 dataset:
aws s3 ls --no-sign-request s3://camelyon-dataset/CAMELYON16/ --recursive aws s3 cp --no-sign-request s3://camelyon-dataset/CAMELYON16/ raw_data/camelyon16 --recursive -
Directory Structure: After downloading, your
raw_data/camelyon16directory should look like this:-- camelyon16 |-- README.md |-- annotations |-- background_tissue |-- checksums.md5 |-- evaluation |-- images |-- license.txt |-- masks `-- pathology-tissue-background-segmentation.json -
Organize Files:
Use the provided script to copy the necessary files into thedatasets/camelyon16directory. If space is limited, modify the script to move files instead of copying them.python move_camelyon16_tifs.py -
Final Directory Structure:
datasets/camelyon16 |-- annotations | |-- test_001.xml | |-- tumor_001.xml | |-- ... |-- masks | |-- normal_001_mask.tif | |-- test_001_mask.tif | |-- tumor_001_mask.tif | |-- ... |-- 0_normal | |-- normal_004.tif | |-- test_018.tif | |-- ... |-- 1_tumor | |-- test_046.tif | |-- tumor_075.tif | |-- ... |-- reference.csv |-- n_shot_dataset_maker.py |-- train_validation_test_reverse_camelyon.py `-- train_validation_test_splitter_camelyon.py
TCGA Lung Cancer
To download the TCGA Lung Cancer dataset, run the following script. This will download the slides listed in
the LUAD manifest
and LUSC manifest to the datasets/tcga/{luad, lusc}
directory. Each slide will be stored in its own directory, named according to its ID in the manifest.
chmod +x download_dataset.sh
./download_tcga_lung.sh
MIL datasets
Download the MIL datasets (sourced from the DSMIL project) and unzip them into the datasets/ directory.
wget https://uwmadison.box.com/shared/static/arvv7f1k8c2m8e2hugqltxgt9zbbpbh2.zip
unzip mil-dataset.zip -d datasets/
Slide Preparation: Patching
CAMELYON16
This script processes TIFF slides located in datasets/camelyon16/{0_normal, 1_tumor}/. For each slide, it creates a
directory at datasets/camelyon16/single/{0_normal, 1_tumor}/{slide_name}, saving the extracted patches as JPEG images.
python deepzoom_tiler_camelyon16.py
TCGA Lung Cancer
This script processes SVS slides in datasets/tcga/{lusc, luad}/ and saves the extracted patches in
datasets/tcga/single/{lusc, luad}/{slide_name} as JPEG images.
python deepzoom_tiler_tcga_lung_cancer.py
For both scripts, please refer to their arguments for detailed information on the script's arguments and their functionalities.
Train/Val/Test Split and N-Shot Dataset Creation
CAMELYON16
To split the CAMELYON16 dataset:
cd datasets/camelyon16
python train_validation_test_splitter_camelyon.py
This script reorganizes the directory structure from:
datasets/camelyon16/single/{0_normal, 1_tumor}
to:
datasets/camelyon16/single/fold1/{train, validation,
Related Skills
proje
Interactive vocabulary learning platform with smart flashcards and spaced repetition for effective language acquisition.
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
research_rules
Research & Verification Rules Quote Verification Protocol Primary Task "Make sure that the quote is relevant to the chapter and so you we want to make sure that we want to have it identifie
groundhog
398Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
