DeepCluster++

STARC-9: A Large-scale Dataset for Multi-Class Tissue Classification for CRC Histopathology, NeruIPS 2025. <br> Barathi Subramanian, Rathinaraja Jeyaraj, Mitchell Nevin Peterson, Terry Guo, Nigam Shah, Curtis Langlotz, Andrew Y Ng, Jeanne Shen
<a href="https://arxiv.org/abs/2511.00383" target="_blank" rel="noopener"> Paper </a> | Cite

Modern computer vision projects, across research and industry, often rely on supervised learning, which in turn demands well-curated, diverse training data. To efficiently gather representative samples from large image collections, we introduce DeepCluster++, a semi-automated dataset curation framework with three stages:

extract feature embeddings for all images using a domain-specific encoder (e.g., an autoencoder or a pre-trained backbone) or suitable pretrained encoder;
cluster the embeddings (e.g., with k-means) to group similar samples and then apply equal-frequency binning within clusters to capture diverse patterns for each class; and
have subject-matter experts review the selected samples to confirm label quality.
Train a classifier model and validate the model performance.

By tuning a small set of parameters, DeepCluster++ lets us balance the number of samples and the level of diversity, substantially reducing manual effort while yielding high-quality training data for robust models.

<div align="center"> <img src="https://github.com/rathinaraja/DeepCluster/blob/main/DeepCluster++.jpg" alt="Example" width="950"/> <p><em>Figure: DeepCluster++ framework overview</em></p> </div>

Typical Workflow - A glance

This example demonstrates how to use DeepCluster++ to curate a diverse training set from tiles extracted out of whole-slide images (WSIs) in digital pathology.

Select WSIs that are representative of your cohort (e.g., cases spanning different tissue types).
Extract and pre-process tiles (e.g., 256×256 pixels) from each WSI.
Preprocess tiles to retain quality tiles.
Arrange tiles on the device in a folder structure (single folder, multiple folders, or nested subfolders). DeepCluster++ is designed to work with any specific subfolders or any of these layouts.
Feature extraction: Use a domain-specific pre-trained autoencoder or any other pathology or real-image foundation model to encode all image tiles in the input directory. To add new encoder to work with this codebase, it is effortlessly very simple.
Clustering: Run k-means on embeddings to group morphologically similar tiles.
Diverse sampling: Apply equal-frequency binning (per cluster) to select a balanced, diverse subset for each class.
Data collection: Review the samples for each WSI and include them in the appropriate class type.
Expert review (optional but recommended): Have a subject-matter expert validate the sampled tiles before finalizing the training set.
To train and validate a variety of classifier models, please visit the STARC-9 Evaluation repository.

Although the workflow is demonstrated using WSIs, it is flexible and can be applied to any domain with a collection of images organized in a folder.

Note:

Paper link is given here: <a href="https://openreview.net/forum?id=rGWjTlK6Ev" target="_blank" rel="noopener"> Openreview </a> or <a href="https://arxiv.org/abs/2511.00383" target="_blank" rel="noopener"> Arxiv </a>.
If you find our work useful in your research or use parts of this code or dataset, please consider citing our paper.
Both the collected dataset and the trained model have been made publicly available for research use. Visit <a href="https://huggingface.co/datasets/Path2AI/STARC-9/tree/main" target="_blank" rel="noopener"> here </a>.

DeepCluster++ Usage Guide

We assume representative WSIs have been selected, tiles extracted, and the resulting images filtered using appropriate preprocessing methods. The AutoEncoder (AE) used in this experiement was trained on a set of tiles (images) until the reconstruction quality of test samples become prominent.

Important to note:

RGB required: Ensure that all the images (tiles) are in RGB format.
Encoder input size: The pre-trained autoencoder used here was trained on images of size 256x256 pixels. If your tiles have a different size, either retrain an autoencoder at that size or use a compatible pre-trained encoder to extract features.
If you have a single folder with images or multiple folders with images or folders with subfolders or etc. We have designed the program work with any folder structure.
Explore the <a href="https://drive.google.com/drive/folders/1pd41-1wAfwGD7XP27OS3KhHAL4xqMXRc" target="_blank" rel="noopener"> input and output folder structure</a> to understand the following instructions.

Input folder structure

Make sure the folder structure is followed as input_folder_1 or input_folder_2. The input folder may either contain images directly (flat structure input_folder_2) or include subfolders (input_folder_1) with images inside as given below.

Refer to the Test_samples_1 or Test_samples_2 folder to visualize the outcomes of the following executions with various inputs. The command-line arguments can be adjusted based on the folder structure.

<pre>/input_path/Test_samples_1 ├── input_folder_1 (WSI_1) │ ├── sub_folder_1 (sub_folder_1) │ │ ├── image1.png │ │ ├── image2.png │ │ └── ... │ ├── sub_folder_2 (sub_folder_2) │ │ └── ... │ └── sub_folder_m (sub_folder_m) ├── input_folder_2 (WSI_2) │ ├── image1.png │ ├── image2.png │ └── ... └── input_folder_n (WSI_n) ├── image1.png ├── image2.png └── ... </pre>

Output folder structure

The following set of files are created in the output folder.

<pre> Output ├── clusters ├── features ├── plots └── samples └── Summary.csv </pre> <pre> clusters (contains clusters of each WSI before sampling) ├── WSI_1 │ ├── Cluster_0 │ ├── Cluster_1 │ └── ... ├── WSI_2 │ └── ... └── ... </pre> <pre> features (2 csv files: features of images and its cluster assignment) ├── WSI_1 │ ├── cluster_assignments.csv │ └── features.csv ├── WSI_2 │ └── ... └── ... </pre> <pre> plots (2 image files: t-SNE visualization with k-means clusters - with and without cluster number) ├── WSI_1 │ ├── tsne_with_legend.png │ └── tsne_with_numbers.png ├── WSI_2 │ └── ... └── ... </pre> <pre> samples (contains samples from the respective clusters) ├── WSI_1 │ ├── Cluster_0 │ ├── Cluster_1 │ ├── Cluster_2 │ └── ... ├── WSI_2 │ └── ... └── ... </pre>

System requirements

Minimum hardware requirements

RAM: 4 GB
Processor: Intel i5/i7 (or AMD Ryzen 5/7 equivalent)
Storage: 512 GB
GPU: Optional (possible to run on CPU)

Recommended hardware requirements for faster execution

RAM: 8 GB
Processor: Intel i7-12th gen or newer (or AMD Ryzen 7 5000 series+)
Storage: 1 TB SSD (NVMe preferred)
GPU: NVIDIA RTX 4060 or better (16GB+ VRAM)

Create a virtual environment and install the required packages

conda env create -f environment.yml
conda activate DeepCluster++

Command-line arguments

Each input folder (WSI) should include a minimum of 256 images to match the 256 PCA components used. Folder with less than 256 images is still valid but there is no DeepCluster++ applied.
Consider the following key details about Test_samples to better understand the command-line arguments.
- Input_path - /path/Test_samples_1
- Input folders (WSIs) - WSI_1, WSI_2, WSI_3, WSI_4, and WSI_5
- Sub folders (if available) - sub_folder_1, sub_folder_2, sub_folder_3, sub_folder_4
Number of cluster for each input folder is determined by taking square root of number of samples in each input folder.

| Argument | Description | |-----------------------------|-----------------------------------------------------------------------------| | --input_path /path/Test_samples_1 | The input path containing a set of input folders, each corresponding to a WSI. | | --selected_input_folders "WSI_1,WSI_2" | Process specific input folders in the path (e.g., WSI names). By default, if not passed, all input folders in the path are considered. | | --sub_folders "sub_folder_1,sub_folder_2" | If an input folder contains subfolders, specify which ones to process. By default, all subfolders in the input folder are considered. | | --process_all True | Process all the images in the given input path regardless of input folders and sub_folders. | | --output_path /path/Test_samples_1_output | Output path to store extracted features, clusters, plots, and samples. | | --feature_ext encoder_name | Encoder name to extract features. | | --device cpu | Optional. Default: None. Device type (cpu or all_gpus). | | --gpu_ids 4,5 | Optional. Default: GPU 0 is assigned. | | --use_gpu_clustering | Optional. Uses GPU for clustering (requires RAPIDS cuML, default: False) | | --batch_size 128 | Optional. Default: 128. Recommended: 256. | | --dim_reduce 256 | Optional. Default: 256. Speicify the dimensionality reduction size. | | --distance_groups 5 | Default: 5. Dividing the cluster into 5 groups.| | --sample_percentage 0.2 | Default: 0.2 (sample 20% in a cluster). Increase this value to collect more samples. | | --model AE_CRC.pth | Model path. By default, AE_CRC.pth in the current path is used. | | --seed 42 | Optional. Default: 42. | | --store_features True | Stor

DeepCluster

Install / Use

README