Herbdl
Explorations in unsupervised learning of herbaria samples
Install / Use
/learn @gardoslab/HerbdlREADME
herbdl
Explorations in unsupervised learning of herbaria samples using deep learning models for plant species classification.
Project Overview
This repository contains experiments on herbarium specimen classification using SWIN Transformers, CLIP, and BioCLIP models. The project runs on Boston University's Shared Computing Cluster (SCC).
Repository Structure
datasets/
Dataset management and preprocessing utilities.
dataset.py-HerbariaClassificationDatasetclass for loading and preprocessing herbarium imagesconstants.py- Path definitions for Kaggle 2021 and 2022 herbarium datasetsmerge_datasets.py/merge_datasets.ipynb- Tools for combining multiple datasets- Supports flexible label columns (species, family, genus) and integrates with HuggingFace AutoImageProcessor
Key Datasets:
- Kaggle Herbarium 2021 dataset
- Kaggle Herbarium 2022 dataset
finetuning/
Model training and evaluation scripts for multiple architectures.
SWIN/
SWIN Transformer model training and evaluation.
SWIN_finetuning.py- Primary training script using HuggingFace Trainer with WandB loggingtrain.py- Custom training loop with layer freezing supporteval.py- Model evaluation utilities- Base model:
microsoft/swin-base-patch4-window12-384
BioCLIP/
BioCLIP zero-shot evaluation for biological domain.
zero_shot.py- Zero-shot evaluation on herbarium datatrain_evaluation.py- Training set evaluation
SWIN-CLIP/
Hybrid model combining SWIN visual features with CLIP text-image alignment.
train.py- Main training scriptmodular_model.py- Modular architecture implementationtrainer.py- Custom trainer implementationtrain_baseline.py- Baseline model training
Configuration:
- Model checkpoints saved in
output/SWIN/kaggle22/ - WandB integration: project
herbdl, entitybu-spark-ml - Environment variables control freezing, learning rate schedules, and run identification
CLIP/
Zero-shot evaluation experiments using OpenAI CLIP.
CLIP_0shot.ipynb- Primary evaluation notebook- Tests species identification with/without visible text labels
- Explores phenology detection (flowers, buds, leaves)
- Documents CLIP's OCR behavior on specimen labels
clustering_viz/
Interactive visualization of learned representations and outlier detection.
Main Workflows:
kaggle22_clustering.ipynb- Feature extraction, PCA/t-SNE dimensionality reduction, and visualization generationasteraceae_outliers.ipynb- Euclidean distance-based outlier detectionoutlier_detection/asteraceae_outliers.ipynb- Advanced Mahalanobis distance-based outlier detectiongenerate_thumbnails.py- Pre-generates optimized thumbnails for fast hover previewsindex.html- Interactive Plotly-based web interface with filtering, search, and image preview
Features:
- Click points to view herbarium specimens (stacks up to 3 images)
- Hover preview with optimized thumbnails (10-20x faster loading)
- Search/filter controls for species clusters
- Axis locking for consistent zoom levels
- Outlier visualization with different markers
descriptions/
Text description generation and web scraping utilities.
scrape_ncsu.py- Scrape plant descriptions from NCSU databasewikipedia_scrape.py- Extract plant information from Wikipediagenerate_conv.py- Generate conversational descriptionsplayground.ipynb- Experimentation notebook
kaggle_eval/
Kaggle competition evaluation scripts and results.
evaluation.py- Evaluation metrics computationCLIP_explain.ipynb- CLIP model interpretability analysisevaluation_result/- JSON files with family, genus, and species-level results
utils/
General utility scripts.
resize_images.py- Batch image resizingimage_install_parallel.py- Parallel image downloadingnotifications.py- Job notification systemcompression.sh- Image compression utilitieslabeling.ipynb- Data labeling tools
Quick Start
Environment Setup
# Create and activate virtual environment
virtualenv venv
source venv/bin/activate
# Install dependencies for finetuning
pip install -r finetuning/requirements.txt
# Install dependencies for clustering visualization
pip install -r clustering_viz/requirements.txt
Running SWIN Finetuning
cd finetuning/SWIN
python SWIN_finetuning.py \
--output_dir ../output/SWIN/kaggle22/ \
--model_name_or_path "microsoft/swin-base-patch4-window12-384" \
--train_file ../datasets/train_22_scientific.json \
--do_train --do_eval \
--per_device_train_batch_size 8 \
--learning_rate 1e-4 \
--num_train_epochs 3
Creating Clustering Visualizations
cd clustering_viz
jupyter notebook kaggle22_clustering.ipynb
# Configure checkpoint path and validation dataset
# Generate thumbnails: python generate_thumbnails.py <plot_json_file>
# Update index.html with JSON filepath
# View via SCC OnDemand
Data Paths (SCC)
- Base project:
/projectnb/herbdl/ - Image data:
/projectnb/herbdl/data/kaggle-herbaria/ - Model checkpoints:
/projectnb/herbdl/workspaces/<username>/herbdl/finetuning/output/
Additional Documentation
CLAUDE.md- Detailed project guide for Claude Code- Subdirectory READMEs in
datasets/,utils/,CLIP/, andfinetuning/output/
License
See LICENSE file for details.
