Tableshift
A benchmark for distribution shift in tabular data
Install / Use
/learn @mlfoundations/TableshiftREADME

TableShift
TableShift is a benchmarking library for machine learning with tabular data under distribution shift.
You can read more about TableShift at tableshift.org or read the full paper (published in NeurIPS 2023 Datasets & Benchmarks Track) on arxiv. If you use the benchmark in your research, please cite the paper:
@article{gardner2023tableshift,
title={Benchmarking Distribution Shift in Tabular Data with TableShift},
author={Gardner, Josh and Popovic, Zoran and Schmidt, Ludwig},
journal={Advances in Neural Information Processing Systems},
year={2023}
}
If you find an issue, please file a GitHub issue.
Quickstart
Environment setup: We recommend the use of docker with TableShift. Our dataset construction and model pipelines have a diverse set of dependencies that included non-Python files required to make some libraries work. As a result, we recommend you use the provided Docker image for using the benchmark, and suggest forking this Docker image for your own development.
# fetch the docker image
docker pull ghcr.io/jpgard/tableshift:latest
# run it to test your setup; this automatically launches examples/run_expt.py
docker run ghcr.io/jpgard/tableshift:latest --model xgb
# optionally, use the container interactively
docker run -it --entrypoint=/bin/bash ghcr.io/jpgard/tableshift:latest
Conda: We recommend using Docker with TableShift when running training or using any of the pretrained modeling code, as the libraries used for training contain a complex and subtle set of dependencies that can be difficult to configure outside Docker. However, Conda might provide a more lightweight environment for basic development and exploration with TableShift, so we describe how to set up Conda here.
To create a conda environment, simply clone this repo, enter the root directory, and run the following commands to create and test a local execution environment:
# set up the environment
conda env create -f environment.yml
conda activate tableshift
# test the install by running the training script
python examples/run_expt.py
The final line above will print some detailed logging output as the script executes. When you see training completed! test accuracy: 0.6221 your environment is ready to go! (Accuracy may vary slightly due to randomness.)
Accessing datasets: If you simply want to load and use a standard version of one of the public TableShift datasets, it's as simple as:
from tableshift import get_dataset
dataset_name = "diabetes_readmission"
dset = get_dataset(dataset_name)
The full list of identifiers for all available datasets is below; simply swap any of these for dataset_name to access the relevant data.
If you would like to use a dataset without a domain split, replace get_dataset() with get_iid_dataset().
The call to get_dataset() returns a TabularDataset that you can use to
easily load tabular data in several formats, including Pandas DataFrame and
PyTorch DataLoaders:
# Fetch a pandas DataFrame of the training set
X_tr, y_tr, _, _ = dset.get_pandas("train")
# Fetch and use a pytorch DataLoader
train_loader = dset.get_dataloader("train", batch_size=1024)
for X, y, _, _ in train_loader:
...
For all TableShift datasets, the following splits are
available: train, validation, id_test, ood_validation, ood_test.
For IID datasets (those without a domain split) these splits are available: train, validation, test.
There is a complete example of a training script in examples/run_expt.py.
Benchmark Dataset Availability
tl;dr: if you want to get started exploring ASAP, use datasets marked as " public" below.
All of the datasets used in the TableShift benchmark are either publicly available or provide open credentialized access. The datasets with open credentialized access require signing a data use agreement; as a result, some datasets must be manually fetched and stored locally. TableShift makes this process as simple as possible.
A list of datasets, their names in TableShift, and the corresponding access
levels are below. The string identifier is the value that should be passed as the experiment parameter
to get_dataset() or the --experiment flag of run_expt.py and other training scripts.
| Dataset | String Identifier | Availability | Source |
|-------------------------|---------------------------|--------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------|
| Voting | anes | Public Credentialized Access (source) | American National Election Studies (ANES) |
| ASSISTments | assistments | Public | Kaggle |
| Childhood Lead | nhanes_lead | Public | National Health and Nutrition Examination Survey (NHANES) |
| College Scorecard | college_scorecard | Public | College Scorecard |
| Diabetes | brfss_diabetes | Public | Behavioral Risk Factor Surveillance System (BRFSS) |
| Food Stamps | acsfoodstamps | Public | American Community Survey (via folktables |
| HELOC | heloc | Public Credentialized Access (source) | FICO |
| Hospital Readmission | diabetes_readmission | Public | UCI |
| Hypertension | brfss_blood_pressure | Public | Behavioral Risk Factor Surveillance System (BRFSS) |
| ICU Length of Stay | mimic_extract_los_3 | Public Credentialized Access (source) | MIMIC-iii via MIMIC-Extract |
| ICU Mortality | mimic_extract_mort_hosp | Public Credentialized Access (source) | MIMIC-iii via MIMIC-Extract |
| Income | acsincome | Public | American Community Survey (via folktables |
| Public Health Insurance | acspubcov | Public | American Community Survey (via folktables |
| Sepsis | physionet | Public | Physionet |
| Unemployment | acsunemployment | Public | American Community Survey (via folktables |
Note that details on the data source, which files to load, and the feature
codings are provided in the TableShift source code for each dataset and data
source (see data_sources.py and the tableshift.datasets module).
For additional, non-benchmark datasets (possibly with only IID splits, not a
