tableshift logo

TableShift

TableShift is a benchmarking library for machine learning with tabular data under distribution shift.

You can read more about TableShift at tableshift.org or read the full paper (published in NeurIPS 2023 Datasets & Benchmarks Track) on arxiv. If you use the benchmark in your research, please cite the paper:

@article{gardner2023tableshift,
  title={Benchmarking Distribution Shift in Tabular Data with TableShift},
  author={Gardner, Josh and Popovic, Zoran and Schmidt, Ludwig},
  journal={Advances in Neural Information Processing Systems},
  year={2023}
}

If you find an issue, please file a GitHub issue.

Quickstart

Environment setup: We recommend the use of docker with TableShift. Our dataset construction and model pipelines have a diverse set of dependencies that included non-Python files required to make some libraries work. As a result, we recommend you use the provided Docker image for using the benchmark, and suggest forking this Docker image for your own development.

# fetch the docker image
docker pull ghcr.io/jpgard/tableshift:latest

# run it to test your setup; this automatically launches examples/run_expt.py
docker run ghcr.io/jpgard/tableshift:latest --model xgb

# optionally, use the container interactively
docker run -it --entrypoint=/bin/bash ghcr.io/jpgard/tableshift:latest

Conda: We recommend using Docker with TableShift when running training or using any of the pretrained modeling code, as the libraries used for training contain a complex and subtle set of dependencies that can be difficult to configure outside Docker. However, Conda might provide a more lightweight environment for basic development and exploration with TableShift, so we describe how to set up Conda here.

To create a conda environment, simply clone this repo, enter the root directory, and run the following commands to create and test a local execution environment:

# set up the environment
conda env create -f environment.yml
conda activate tableshift
# test the install by running the training script
python examples/run_expt.py

The final line above will print some detailed logging output as the script executes. When you see training completed! test accuracy: 0.6221 your environment is ready to go! (Accuracy may vary slightly due to randomness.)

Accessing datasets: If you simply want to load and use a standard version of one of the public TableShift datasets, it's as simple as:

from tableshift import get_dataset

dataset_name = "diabetes_readmission"
dset = get_dataset(dataset_name)

The full list of identifiers for all available datasets is below; simply swap any of these for dataset_name to access the relevant data.

If you would like to use a dataset without a domain split, replace get_dataset() with get_iid_dataset().

The call to get_dataset() returns a TabularDataset that you can use to easily load tabular data in several formats, including Pandas DataFrame and PyTorch DataLoaders:

# Fetch a pandas DataFrame of the training set
X_tr, y_tr, _, _ = dset.get_pandas("train")

# Fetch and use a pytorch DataLoader
train_loader = dset.get_dataloader("train", batch_size=1024)

for X, y, _, _ in train_loader:
    ...

For all TableShift datasets, the following splits are available: train, validation, id_test, ood_validation, ood_test.

For IID datasets (those without a domain split) these splits are available: train, validation, test.

There is a complete example of a training script in examples/run_expt.py.

Benchmark Dataset Availability

tl;dr: if you want to get started exploring ASAP, use datasets marked as " public" below.

All of the datasets used in the TableShift benchmark are either publicly available or provide open credentialized access. The datasets with open credentialized access require signing a data use agreement; as a result, some datasets must be manually fetched and stored locally. TableShift makes this process as simple as possible.

A list of datasets, their names in TableShift, and the corresponding access levels are below. The string identifier is the value that should be passed as the experiment parameter to get_dataset() or the --experiment flag of run_expt.py and other training scripts.

| Dataset | String Identifier | Availability | Source | |-------------------------|---------------------------|--------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------| | Voting | anes | Public Credentialized Access (source) | American National Election Studies (ANES) | | ASSISTments | assistments | Public | Kaggle | | Childhood Lead | nhanes_lead | Public | National Health and Nutrition Examination Survey (NHANES) | | College Scorecard | college_scorecard | Public | College Scorecard | | Diabetes | brfss_diabetes | Public | Behavioral Risk Factor Surveillance System (BRFSS) | | Food Stamps | acsfoodstamps | Public | American Community Survey (via folktables | | HELOC | heloc | Public Credentialized Access (source) | FICO | | Hospital Readmission | diabetes_readmission | Public | UCI | | Hypertension | brfss_blood_pressure | Public | Behavioral Risk Factor Surveillance System (BRFSS) | | ICU Length of Stay | mimic_extract_los_3 | Public Credentialized Access (source) | MIMIC-iii via MIMIC-Extract | | ICU Mortality | mimic_extract_mort_hosp | Public Credentialized Access (source) | MIMIC-iii via MIMIC-Extract | | Income | acsincome | Public | American Community Survey (via folktables | | Public Health Insurance | acspubcov | Public | American Community Survey (via folktables | | Sepsis | physionet | Public | Physionet | | Unemployment | acsunemployment | Public | American Community Survey (via folktables |

Note that details on the data source, which files to load, and the feature codings are provided in the TableShift source code for each dataset and data source (see data_sources.py and the tableshift.datasets module).

For additional, non-benchmark datasets (possibly with only IID splits, not a

Tableshift

Install / Use

README

TableShift

Quickstart

Benchmark Dataset Availability