DeepCluster
Generating training samples for computer vision tasks
Install / Use
/learn @rathinaraja/DeepClusterREADME
DeepCluster++
STARC-9: A Large-scale Dataset for Multi-Class Tissue Classification for CRC Histopathology, NeruIPS 2025. <br>
Barathi Subramanian, Rathinaraja Jeyaraj, Mitchell Nevin Peterson, Terry Guo, Nigam Shah, Curtis Langlotz, Andrew Y Ng, Jeanne Shen
<a href="https://arxiv.org/abs/2511.00383" target="_blank" rel="noopener"> Paper </a> | Cite
Modern computer vision projects, across research and industry, often rely on supervised learning, which in turn demands well-curated, diverse training data. To efficiently gather representative samples from large image collections, we introduce DeepCluster++, a semi-automated dataset curation framework with three stages:
- extract feature embeddings for all images using a domain-specific encoder (e.g., an autoencoder or a pre-trained backbone) or suitable pretrained encoder;
- cluster the embeddings (e.g., with k-means) to group similar samples and then apply equal-frequency binning within clusters to capture diverse patterns for each class; and
- have subject-matter experts review the selected samples to confirm label quality.
- Train a classifier model and validate the model performance.
By tuning a small set of parameters, DeepCluster++ lets us balance the number of samples and the level of diversity, substantially reducing manual effort while yielding high-quality training data for robust models.
<div align="center"> <img src="https://github.com/rathinaraja/DeepCluster/blob/main/DeepCluster++.jpg" alt="Example" width="950"/> <p><em>Figure: DeepCluster++ framework overview</em></p> </div>Typical Workflow - A glance
This example demonstrates how to use DeepCluster++ to curate a diverse training set from tiles extracted out of whole-slide images (WSIs) in digital pathology.
- Select WSIs that are representative of your cohort (e.g., cases spanning different tissue types).
- Extract and pre-process tiles (e.g., 256×256 pixels) from each WSI.
- Preprocess tiles to retain quality tiles.
- Arrange tiles on the device in a folder structure (single folder, multiple folders, or nested subfolders). DeepCluster++ is designed to work with any specific subfolders or any of these layouts.
- Feature extraction: Use a domain-specific pre-trained autoencoder or any other pathology or real-image foundation model to encode all image tiles in the input directory. To add new encoder to work with this codebase, it is effortlessly very simple.
- Clustering: Run k-means on embeddings to group morphologically similar tiles.
- Diverse sampling: Apply equal-frequency binning (per cluster) to select a balanced, diverse subset for each class.
- Data collection: Review the samples for each WSI and include them in the appropriate class type.
- Expert review (optional but recommended): Have a subject-matter expert validate the sampled tiles before finalizing the training set.
- To train and validate a variety of classifier models, please visit the STARC-9 Evaluation repository.
Although the workflow is demonstrated using WSIs, it is flexible and can be applied to any domain with a collection of images organized in a folder.
Note:
- Paper link is given here: <a href="https://openreview.net/forum?id=rGWjTlK6Ev" target="_blank" rel="noopener"> Openreview </a> or <a href="https://arxiv.org/abs/2511.00383" target="_blank" rel="noopener"> Arxiv </a>.
- If you find our work useful in your research or use parts of this code or dataset, please consider citing our paper.
- Both the collected dataset and the trained model have been made publicly available for research use. Visit <a href="https://huggingface.co/datasets/Path2AI/STARC-9/tree/main" target="_blank" rel="noopener"> here </a>.
DeepCluster++ Usage Guide
We assume representative WSIs have been selected, tiles extracted, and the resulting images filtered using appropriate preprocessing methods. The AutoEncoder (AE) used in this experiement was trained on a set of tiles (images) until the reconstruction quality of test samples become prominent.
Important to note:
- RGB required: Ensure that all the images (tiles) are in RGB format.
- Encoder input size: The pre-trained autoencoder used here was trained on images of size 256x256 pixels. If your tiles have a different size, either retrain an autoencoder at that size or use a compatible pre-trained encoder to extract features.
- If you have a single folder with images or multiple folders with images or folders with subfolders or etc. We have designed the program work with any folder structure.
- Explore the <a href="https://drive.google.com/drive/folders/1pd41-1wAfwGD7XP27OS3KhHAL4xqMXRc" target="_blank" rel="noopener"> input and output folder structure</a> to understand the following instructions.
Input folder structure
Make sure the folder structure is followed as input_folder_1 or input_folder_2. The input folder may either contain images directly (flat structure input_folder_2) or include subfolders (input_folder_1) with images inside as given below.
Refer to the Test_samples_1 or Test_samples_2 folder to visualize the outcomes of the following executions with various inputs. The command-line arguments can be adjusted based on the folder structure.
<pre>/input_path/Test_samples_1 ├── input_folder_1 (WSI_1) │ ├── sub_folder_1 (sub_folder_1) │ │ ├── image1.png │ │ ├── image2.png │ │ └── ... │ ├── sub_folder_2 (sub_folder_2) │ │ └── ... │ └── sub_folder_m (sub_folder_m) ├── input_folder_2 (WSI_2) │ ├── image1.png │ ├── image2.png │ └── ... └── input_folder_n (WSI_n) ├── image1.png ├── image2.png └── ... </pre>Output folder structure
The following set of files are created in the output folder.
<pre> Output ├── clusters ├── features ├── plots └── samples └── Summary.csv </pre> <pre> clusters (contains clusters of each WSI before sampling) ├── WSI_1 │ ├── Cluster_0 │ ├── Cluster_1 │ └── ... ├── WSI_2 │ └── ... └── ... </pre> <pre> features (2 csv files: features of images and its cluster assignment) ├── WSI_1 │ ├── cluster_assignments.csv │ └── features.csv ├── WSI_2 │ └── ... └── ... </pre> <pre> plots (2 image files: t-SNE visualization with k-means clusters - with and without cluster number) ├── WSI_1 │ ├── tsne_with_legend.png │ └── tsne_with_numbers.png ├── WSI_2 │ └── ... └── ... </pre> <pre> samples (contains samples from the respective clusters) ├── WSI_1 │ ├── Cluster_0 │ ├── Cluster_1 │ ├── Cluster_2 │ └── ... ├── WSI_2 │ └── ... └── ... </pre>System requirements
Minimum hardware requirements
- RAM: 4 GB
- Processor: Intel i5/i7 (or AMD Ryzen 5/7 equivalent)
- Storage: 512 GB
- GPU: Optional (possible to run on CPU)
Recommended hardware requirements for faster execution
- RAM: 8 GB
- Processor: Intel i7-12th gen or newer (or AMD Ryzen 7 5000 series+)
- Storage: 1 TB SSD (NVMe preferred)
- GPU: NVIDIA RTX 4060 or better (16GB+ VRAM)
Create a virtual environment and install the required packages
conda env create -f environment.yml
conda activate DeepCluster++
Command-line arguments
- Each input folder (WSI) should include a minimum of 256 images to match the 256 PCA components used. Folder with less than 256 images is still valid but there is no DeepCluster++ applied.
- Consider the following key details about Test_samples to better understand the command-line arguments.
- Input_path - /path/Test_samples_1
- Input folders (WSIs) - WSI_1, WSI_2, WSI_3, WSI_4, and WSI_5
- Sub folders (if available) - sub_folder_1, sub_folder_2, sub_folder_3, sub_folder_4
- Number of cluster for each input folder is determined by taking square root of number of samples in each input folder.
| Argument | Description |
|-----------------------------|-----------------------------------------------------------------------------|
| --input_path /path/Test_samples_1 | The input path containing a set of input folders, each corresponding to a WSI. |
| --selected_input_folders "WSI_1,WSI_2" | Process specific input folders in the path (e.g., WSI names). By default, if not passed, all input folders in the path are considered. |
| --sub_folders "sub_folder_1,sub_folder_2" | If an input folder contains subfolders, specify which ones to process. By default, all subfolders in the input folder are considered. |
| --process_all True | Process all the images in the given input path regardless of input folders and sub_folders. |
| --output_path /path/Test_samples_1_output | Output path to store extracted features, clusters, plots, and samples. |
| --feature_ext encoder_name | Encoder name to extract features. |
| --device cpu | Optional. Default: None. Device type (cpu or all_gpus). |
| --gpu_ids 4,5 | Optional. Default: GPU 0 is assigned. |
| --use_gpu_clustering | Optional. Uses GPU for clustering (requires RAPIDS cuML, default: False) |
| --batch_size 128 | Optional. Default: 128. Recommended: 256. |
| --dim_reduce 256 | Optional. Default: 256. Speicify the dimensionality reduction size. |
| --distance_groups 5 | Default: 5. Dividing the cluster into 5 groups.|
| --sample_percentage 0.2 | Default: 0.2 (sample 20% in a cluster). Increase this value to collect more samples. |
| --model AE_CRC.pth | Model path. By default, AE_CRC.pth in the current path is used. |
| --seed 42 | Optional. Default: 42. |
| --store_features True | Stor
