Snuffy

Snuffy: Efficient Whole Slide Image Classifier For Efficient and Performant Diagnosis in Pathology Whole Slide Images

Generate Convert Improve

Install / Use

/learn @jafarinia/Snuffy

About this skill

Quality Score

0/100

README

Snuffy: Efficient Whole Slide Image Classifier

Hossein Jafarinia, Alireza Alipanah, Danial Hamdi, Saeed Razavi, Nahal Mirzaie, Mohammad Hossein Rohban

[arXiv] [Project Page] [Demo] [BibTex]

PyTorch implementation for the Multiple Instance Learning framework described in the paper Snuffy: Efficient Whole Slide Image Classifier (ECCV 2024, accepted).

Snuffy is a novel MIL-pooling method based on sparse transformers, designed to address the computational challenges in Whole Slide Image (WSI) classification for digital pathology. Our approach mitigates performance loss with limited pre-training and enables continual few-shot pre-training as a competitive option.

Key features:

Tailored sparsity pattern for pathology
Theoretically proven universal approximator with tight probabilistic sharp bounds
Superior WSI and patch-level accuracies on CAMELYON16 and TCGA Lung cancer datasets

Overview

This repository provides a complete, runnable implementation of the Snuffy framework, including code for the FROC metric, which is unique among WSI classification frameworks to the best of our knowledge.

Slide Patching: WSIs are divided into manageable patches.
Self-Supervised Learning: An SSL method is trained on the patches to create an embedder.
Feature Extraction: The embedder computes features (embeddings) for each slide.
MIL Training: The Snuffy MIL framework is applied to the computed features.

Each step in this pipeline can be executed independently, with intermediate results available for download to facilitate continued processing.

<details> <summary>Table of Contents</summary> <ol> <li><a href="#requirements">Requirements</a></li> <li><a href="#dataset-download">Dataset Download</a></li> <li><a href="#train-val-test-split">Train/Val/Test Split</a></li> <li><a href="#slide-preparation-patching-and-n-shot-dataset-creation">Slide Preparation: Patching and N-Shot Dataset Creation</a></li> <li><a href="#training-the-embedder">Training the Embedder</a></li> <li><a href="#feature-extraction">Feature Extraction</a></li> <li><a href="#mil-training">MIL Training</a></li> <li><a href="#visualization">Visualization</a></li> <li><a href="#acknowledgement">Acknowledgement</a></li> <li><a href="#citation">Citation</a></li> </ol> </details>

Requirements

System Requirements

Operating System: Ubuntu 20.04 LTS (or compatible Linux distribution)
Python Version: 3.8 or later
GPU: Recommended for faster processing (CUDA-compatible)

Notes

Disk Space: Ensure you have sufficient disk space for dataset downloads and processing, especially if you intend to work with raw slides rather than pre-computed embeddings. Raw slide data can be very large.
Hardware: The MIL training code can run on both GPU and CPU. For optimal performance, a GPU is strongly recommended.

Downloading and Preparing Datasets

Amazon CLI: To download the CAMELYON16 dataset's raw whole-slide images, you'll need the AWS CLI. Install it by:

curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
./aws/install

GDC Client (For downloading the TCGA dataset): This is automatically downloaded and installed when you use the download_tcga_lung.sh script.
OpenSlide is necessary if you intend to patch the slides yourself using the deepzoom_tiler_camelyon16.py or deepzoom_tiler_tcga_lung_cancer.py scripts. Install OpenSlide with:

# Update package list and install OpenSlide
apt-get update
apt-get install openslide-tools

Running Snuffy

The ASAP package is required for calculating the FROC metric. Install ASAP and its multiresolutionimageinterface Python package as follows:

# Download and install ASAP
wget https://github.com/computationalpathologygroup/ASAP/releases/download/ASAP-2.1/ASAP-2.1-py38-Ubuntu2004.deb
apt-get install -f "./ASAP-2.1-py38-Ubuntu2004.deb"

Required Python packages can be installed with:

# Install Python packages from requirements.txt
pip install -r requirements.txt

Note: The requirements.txt file includes specific package versions used and verified in our experiments. However, newer versions available in your environment may also be compatible.

Additional Components

MAE with Adapter: Refer to the MAE repository for installation instructions.

Important: If using PyTorch versions 1.8+ , follow the instructions in the MAE repository to fix compatibility issue with the timm module. Alternatively, run the following script to fix the issue.
```
chmod +x requirements_timm_patch.sh
./requirements_timm_patch.sh
```
Note that we've also included a modified version of timm, to support adapter functionality.

Download Data

CAMELYON16

List and Download Dataset: Run the following commands to list and download the CAMELYON16 dataset:

aws s3 ls --no-sign-request s3://camelyon-dataset/CAMELYON16/ --recursive
aws s3 cp --no-sign-request s3://camelyon-dataset/CAMELYON16/ raw_data/camelyon16 --recursive

Directory Structure: After downloading, your raw_data/camelyon16 directory should look like this:

-- camelyon16
    |-- README.md
    |-- annotations
    |-- background_tissue
    |-- checksums.md5
    |-- evaluation
    |-- images
    |-- license.txt
    |-- masks
    `-- pathology-tissue-background-segmentation.json

Organize Files:
Use the provided script to copy the necessary files into the datasets/camelyon16 directory. If space is limited, modify the script to move files instead of copying them.
```
python move_camelyon16_tifs.py
```

Final Directory Structure:

datasets/camelyon16
|-- annotations
|   |-- test_001.xml
|   |-- tumor_001.xml
|   |-- ...
|-- masks
|   |-- normal_001_mask.tif
|   |-- test_001_mask.tif
|   |-- tumor_001_mask.tif
|   |-- ...
|-- 0_normal
|   |-- normal_004.tif
|   |-- test_018.tif
|   |-- ...
|-- 1_tumor
|   |-- test_046.tif
|   |-- tumor_075.tif
|   |-- ...
|-- reference.csv
|-- n_shot_dataset_maker.py
|-- train_validation_test_reverse_camelyon.py
`-- train_validation_test_splitter_camelyon.py

TCGA Lung Cancer

To download the TCGA Lung Cancer dataset, run the following script. This will download the slides listed in the LUAD manifest and LUSC manifest to the datasets/tcga/{luad, lusc} directory. Each slide will be stored in its own directory, named according to its ID in the manifest.

chmod +x download_dataset.sh
./download_tcga_lung.sh

MIL datasets

Download the MIL datasets (sourced from the DSMIL project) and unzip them into the datasets/ directory.

wget https://uwmadison.box.com/shared/static/arvv7f1k8c2m8e2hugqltxgt9zbbpbh2.zip
unzip mil-dataset.zip -d datasets/

Slide Preparation: Patching

CAMELYON16

This script processes TIFF slides located in datasets/camelyon16/{0_normal, 1_tumor}/. For each slide, it creates a directory at datasets/camelyon16/single/{0_normal, 1_tumor}/{slide_name}, saving the extracted patches as JPEG images.

python deepzoom_tiler_camelyon16.py

TCGA Lung Cancer

This script processes SVS slides in datasets/tcga/{lusc, luad}/ and saves the extracted patches in datasets/tcga/single/{lusc, luad}/{slide_name} as JPEG images.

python deepzoom_tiler_tcga_lung_cancer.py

For both scripts, please refer to their arguments for detailed information on the script's arguments and their functionalities.

Train/Val/Test Split and N-Shot Dataset Creation

CAMELYON16

To split the CAMELYON16 dataset:

cd datasets/camelyon16
python train_validation_test_splitter_camelyon.py

This script reorganizes the directory structure from:

datasets/camelyon16/single/{0_normal, 1_tumor}

to:

datasets/camelyon16/single/fold1/{train, validation,

Related Skills

proje

Interactive vocabulary learning platform with smart flashcards and spaced repetition for effective language acquisition.

YC-Killer

2.7k

A library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.

research_rules

Research & Verification Rules Quote Verification Protocol Primary Task "Make sure that the quote is relevant to the chapter and so you we want to make sure that we want to have it identifie

groundhog

398

Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).

jafarinia

View profile

View on GitHub

GitHub Stars54

CategoryEducation

Updated24d ago

Forks5

jafarinia/snuffy

Languages

Python

Security Score

100/100

Audited on Mar 10, 2026

No findings