SkillAgentSearch skills...

Snuffy

Snuffy: Efficient Whole Slide Image Classifier For Efficient and Performant Diagnosis in Pathology Whole Slide Images

Install / Use

/learn @jafarinia/Snuffy
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Snuffy: Efficient Whole Slide Image Classifier

Static Badge PWC PWC

Hossein Jafarinia, Alireza Alipanah, Danial Hamdi, Saeed Razavi, Nahal Mirzaie, Mohammad Hossein Rohban

[arXiv] [Project Page] [Demo] [BibTex]

PyTorch implementation for the Multiple Instance Learning framework described in the paper Snuffy: Efficient Whole Slide Image Classifier (ECCV 2024, accepted).


<p> <img src="figs/architecture.png"> </p>

Snuffy is a novel MIL-pooling method based on sparse transformers, designed to address the computational challenges in Whole Slide Image (WSI) classification for digital pathology. Our approach mitigates performance loss with limited pre-training and enables continual few-shot pre-training as a competitive option.

Key features:

  • Tailored sparsity pattern for pathology
  • Theoretically proven universal approximator with tight probabilistic sharp bounds
  • Superior WSI and patch-level accuracies on CAMELYON16 and TCGA Lung cancer datasets

Overview

This repository provides a complete, runnable implementation of the Snuffy framework, including code for the FROC metric, which is unique among WSI classification frameworks to the best of our knowledge.

  1. Slide Patching: WSIs are divided into manageable patches.
  2. Self-Supervised Learning: An SSL method is trained on the patches to create an embedder.
  3. Feature Extraction: The embedder computes features (embeddings) for each slide.
  4. MIL Training: The Snuffy MIL framework is applied to the computed features.

Each step in this pipeline can be executed independently, with intermediate results available for download to facilitate continued processing.

<details> <summary>Table of Contents</summary> <ol> <li><a href="#requirements">Requirements</a></li> <li><a href="#dataset-download">Dataset Download</a></li> <li><a href="#train-val-test-split">Train/Val/Test Split</a></li> <li><a href="#slide-preparation-patching-and-n-shot-dataset-creation">Slide Preparation: Patching and N-Shot Dataset Creation</a></li> <li><a href="#training-the-embedder">Training the Embedder</a></li> <li><a href="#feature-extraction">Feature Extraction</a></li> <li><a href="#mil-training">MIL Training</a></li> <li><a href="#visualization">Visualization</a></li> <li><a href="#acknowledgement">Acknowledgement</a></li> <li><a href="#citation">Citation</a></li> </ol> </details>

Requirements

System Requirements

  • Operating System: Ubuntu 20.04 LTS (or compatible Linux distribution)
  • Python Version: 3.8 or later
  • GPU: Recommended for faster processing (CUDA-compatible)

Notes

  • Disk Space: Ensure you have sufficient disk space for dataset downloads and processing, especially if you intend to work with raw slides rather than pre-computed embeddings. Raw slide data can be very large.
  • Hardware: The MIL training code can run on both GPU and CPU. For optimal performance, a GPU is strongly recommended.

Downloading and Preparing Datasets

  1. Amazon CLI: To download the CAMELYON16 dataset's raw whole-slide images, you'll need the AWS CLI. Install it by:
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
./aws/install
  1. GDC Client (For downloading the TCGA dataset): This is automatically downloaded and installed when you use the download_tcga_lung.sh script.

  2. OpenSlide is necessary if you intend to patch the slides yourself using the deepzoom_tiler_camelyon16.py or deepzoom_tiler_tcga_lung_cancer.py scripts. Install OpenSlide with:

# Update package list and install OpenSlide
apt-get update
apt-get install openslide-tools

Running Snuffy

  1. The ASAP package is required for calculating the FROC metric. Install ASAP and its multiresolutionimageinterface Python package as follows:
# Download and install ASAP
wget https://github.com/computationalpathologygroup/ASAP/releases/download/ASAP-2.1/ASAP-2.1-py38-Ubuntu2004.deb
apt-get install -f "./ASAP-2.1-py38-Ubuntu2004.deb"
  1. Required Python packages can be installed with:
# Install Python packages from requirements.txt
pip install -r requirements.txt

Note: The requirements.txt file includes specific package versions used and verified in our experiments. However, newer versions available in your environment may also be compatible.

Additional Components

  1. MAE with Adapter: Refer to the MAE repository for installation instructions.

    Important: If using PyTorch versions 1.8+ , follow the instructions in the MAE repository to fix compatibility issue with the timm module. Alternatively, run the following script to fix the issue.

    chmod +x requirements_timm_patch.sh
    ./requirements_timm_patch.sh
    

    Note that we've also included a modified version of timm, to support adapter functionality.

Download Data

CAMELYON16

  1. List and Download Dataset: Run the following commands to list and download the CAMELYON16 dataset:

    aws s3 ls --no-sign-request s3://camelyon-dataset/CAMELYON16/ --recursive
    aws s3 cp --no-sign-request s3://camelyon-dataset/CAMELYON16/ raw_data/camelyon16 --recursive
    
  2. Directory Structure: After downloading, your raw_data/camelyon16 directory should look like this:

    -- camelyon16
        |-- README.md
        |-- annotations
        |-- background_tissue
        |-- checksums.md5
        |-- evaluation
        |-- images
        |-- license.txt
        |-- masks
        `-- pathology-tissue-background-segmentation.json
    
  3. Organize Files:
    Use the provided script to copy the necessary files into the datasets/camelyon16 directory. If space is limited, modify the script to move files instead of copying them.

    python move_camelyon16_tifs.py
    
  4. Final Directory Structure:

    datasets/camelyon16
    |-- annotations
    |   |-- test_001.xml
    |   |-- tumor_001.xml
    |   |-- ...
    |-- masks
    |   |-- normal_001_mask.tif
    |   |-- test_001_mask.tif
    |   |-- tumor_001_mask.tif
    |   |-- ...
    |-- 0_normal
    |   |-- normal_004.tif
    |   |-- test_018.tif
    |   |-- ...
    |-- 1_tumor
    |   |-- test_046.tif
    |   |-- tumor_075.tif
    |   |-- ...
    |-- reference.csv
    |-- n_shot_dataset_maker.py
    |-- train_validation_test_reverse_camelyon.py
    `-- train_validation_test_splitter_camelyon.py
    

TCGA Lung Cancer

To download the TCGA Lung Cancer dataset, run the following script. This will download the slides listed in the LUAD manifest and LUSC manifest to the datasets/tcga/{luad, lusc} directory. Each slide will be stored in its own directory, named according to its ID in the manifest.

chmod +x download_dataset.sh
./download_tcga_lung.sh

MIL datasets

Download the MIL datasets (sourced from the DSMIL project) and unzip them into the datasets/ directory.

wget https://uwmadison.box.com/shared/static/arvv7f1k8c2m8e2hugqltxgt9zbbpbh2.zip
unzip mil-dataset.zip -d datasets/

Slide Preparation: Patching

CAMELYON16

This script processes TIFF slides located in datasets/camelyon16/{0_normal, 1_tumor}/. For each slide, it creates a directory at datasets/camelyon16/single/{0_normal, 1_tumor}/{slide_name}, saving the extracted patches as JPEG images.

python deepzoom_tiler_camelyon16.py

TCGA Lung Cancer

This script processes SVS slides in datasets/tcga/{lusc, luad}/ and saves the extracted patches in datasets/tcga/single/{lusc, luad}/{slide_name} as JPEG images.

python deepzoom_tiler_tcga_lung_cancer.py

For both scripts, please refer to their arguments for detailed information on the script's arguments and their functionalities.

Train/Val/Test Split and N-Shot Dataset Creation

CAMELYON16

To split the CAMELYON16 dataset:

cd datasets/camelyon16
python train_validation_test_splitter_camelyon.py

This script reorganizes the directory structure from:

datasets/camelyon16/single/{0_normal, 1_tumor}

to:

datasets/camelyon16/single/fold1/{train, validation,

Related Skills

View on GitHub
GitHub Stars54
CategoryEducation
Updated24d ago
Forks5

Languages

Python

Security Score

100/100

Audited on Mar 10, 2026

No findings