SkillAgentSearch skills...

SubpopBench

[ICML 2023] Change is Hard: A Closer Look at Subpopulation Shift

Install / Use

/learn @YyzHarry/SubpopBench

README

<p align="center"> <img src="assets/logo.png" align="center" width="80%"> </p>

License

Overview

SubpopBench is a benchmark of subpopulation shift. It is a living PyTorch suite containing benchmark datasets and algorithms for subpopulation shift, as introduced in Change is Hard: A Closer Look at Subpopulation Shift (Yang et al., ICML 2023).

Contents

Currently we support 13 datasets and ~20 algorithms that span different learning strategies. Feel free to send us a PR to add your algorithm / dataset for subpopulation shift.

Available Algorithms

The currently available algorithms are:

Send us a PR to add your algorithm! Our implementations use the hyper-parameter grids described here.

Available Datasets

The currently available datasets are:

Send us a PR to add your dataset! You can follow the dataset format described here.

Model Architectures & Pretraining Methods

The supported image architectures are:

The supported text architectures are:

Note that text architectures are only compatible with CivilComments.

Subpopulation Shift Scenarios

We characterize four basic types of subpopulation shift using our framework, and categorize each dataset into its most dominant shift type.

  • Spurious Correlations (SC): certain $a$ is spuriously correlated with $y$ in training but not in testing.
  • Attribute Imbalance (AI): certain attributes are sampled with a much smaller probability than others in $p_{\text{train}}$, but not in $p_{\text{test}}$.
  • Class Imbalance (CI): certain (minority) classes are underrepresented in $p_{\text{train}}$, but not in $p_{\text{test}}$.
  • Attribute Generalization (AG): certain attributes can be totally missing in $p_{\text{train}}$, but present in $p_{\text{test}}$.

Evaluation Metrics

We include a variety of metrics aiming for a thorough evaluation from different aspects:

  • Average Accuracy & Worst Accuracy
  • Average Precision & Worst Precision
  • Average F1-score & Worst F1-score
  • Adjusted Accuracy
  • Balanced Accuracy
  • AUROC & AUPRC
  • Expected Calibration Error (ECE)

Model Selection Criteria

We highlight the impact of whether attribute is known in (1) training set and (2) validation set, where the former is specified by --train_attr in train.py, and the latter is specified by model selection criteria. We show a few important selection criteria:

  • OracleWorstAcc: Picks the best test-set worst-group accuracy (oracle)
  • ValWorstAccAttributeYes: Picks the best val-set worst-group accuracy (attributes known in validation)
  • ValWorstAccAttributeNo: Picks the best val-set worst-class accuracy (attributes unknown in validation; group degenerates to class)

Getting Started

Installation

Prerequisites

Run the following commands to clone this repo and create the Conda environment:

git clone git@github.com:YyzHarry/SubpopBench.git
cd SubpopBench/
conda env create -f environment.yml
conda activate subpop_bench

Downloading Data

Download the original datasets and generate corresponding metadata in your data_path:

python -m subpopbench.scripts.download --data_path <data_path> --download

For MIMICNoFinding, CheXpertNoFinding, CXRMultisite, and MIMICNotes, see MedicalData.md for instructions for downloading the datasets manually.

Code Overview

Main Files

  • train.py: main training script
  • sweep.py: launch a sweep with all selected algorithms (provided in subpopbench/learning/algorithms.py) on all subpopulation shift datasets
  • collect_results.py: collect sweep results to automatically generate result tables (as in the paper)

Main Arguments

  • train.py:
    • --dataset: name of chosen subpopulation dataset
    • --algorithm: choose algorithm used for running
    • --train_attr: whether attributes are known or not during training (yes or no)
    • --data_dir: data path
    • --output_dir: output path
    • --output_folder_name: outpu

Related Skills

View on GitHub
GitHub Stars111
CategoryDevelopment
Updated4mo ago
Forks20

Languages

Python

Security Score

97/100

Audited on Nov 6, 2025

No findings