Folktables is a Python package that provides access to datasets derived from the US Census, facilitating the benchmarking of machine learning algorithms. The package includes a suite of pre-defined prediction tasks in domains including income, employment, health, transportation, and housing, and also includes tools for creating new prediction tasks of interest in the US Census data ecosystem. The package additionally enables systematic studies of the effect of distribution shift, as each prediction task can be instantiated on datasets spanning multiple years and all states within the US.

Why the name? Folktables is a neologism describing tabular data about individuals. It emphasizes that data has the power to create and shape narratives about populations and challenges us to think carefully about the data we collect and use.

For more information about these datasets, including the motivations behind their curation and some examples of empirical findings, please see our paper and/or this video presentation.

Basic installation instructions
Quick start examples
Prediction tasks in folktables
Scope and limitations
Citing folktables
References

Folktables is still under active development! If you find bugs or have feature requests, please file a Github issue. We welcome all kinds of issues, especially those related to correctness, documentation, performance, and new features.

Motivated by Datasheets for datasets (Gebru et al.), please see [Datasheet] for the Folktables datasheet.

Basic installation instructions

(Optionally) create a virtual environment

python3 -m venv folkenv
source folkenv/bin/activate

Install via pip

pip install folktables

You can also install folktables directly from source.

git clone https://github.com/zykls/folktables.git
cd folktables
pip install -r requirements.txt

Quick start examples

Folktables contains a suite of prediction tasks derived from US Census data that can be easily downloaded and used for a variety of benchmarking tasks. For information about the features, response, or group membership coding for any of the datasets, please refer to the ACS PUMS documentation.

Evaluating algorithms for fair machine learning

We first construct a data source for the 2018 yearly American Community Survey, download the corresponding data for Alabama, and use this data to instantiate a prediction task of interest, for example, the ACSEmployment task.

from folktables import ACSDataSource, ACSEmployment

data_source = ACSDataSource(survey_year='2018', horizon='1-Year', survey='person')
acs_data = data_source.get_data(states=["AL"], download=True)
features, label, group = ACSEmployment.df_to_numpy(acs_data)

Next we train a simple model on this dataset and use the group labels to evaluate the model's violation of equality of opportunity, a common fairness metric.

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

X_train, X_test, y_train, y_test, group_train, group_test = train_test_split(
    features, label, group, test_size=0.2, random_state=0)

###### Your favorite learning algorithm here #####
model = make_pipeline(StandardScaler(), LogisticRegression())
model.fit(X_train, y_train)

yhat = model.predict(X_test)

white_tpr = np.mean(yhat[(y_test == 1) & (group_test == 1)])
black_tpr = np.mean(yhat[(y_test == 1) & (group_test == 2)])

# Equality of opportunity violation: 0.0871
white_tpr - black_tpr

The ACS data source contains data for all fifty states, each of which has a slightly different distribution of features and response. This increases the diversity of environments in which we can evaluate our methods. For instance, we can generate another ACSEmployment task using data from Texas and repeat the experiment

acs_tx = data_source.get_data(states=["TX"], download=True)
tx_features, tx_label, tx_group = ACSEmployment.df_to_numpy(acs_tx)

features, label, group = ACSEmployment.df_to_numpy(acs_tx)
X_train, X_test, y_train, y_test, group_train, group_test = train_test_split(
    tx_features, tx_label, tx_group, test_size=0.2, random_state=0)

model = make_pipeline(StandardScaler(), LogisticRegression())
model.fit(X_train, y_train)

yhat = model.predict(X_test)
white_tpr = np.mean(yhat[(y_test == 1) & (group_test == 1)])
black_tpr = np.mean(yhat[(y_test == 1) & (group_test == 2)])

# Equality of opportunity violation: 0.0397
white_tpr - black_tpr

Distribution shift across states

Each prediction problem in Folktables can be instantiated on data from every US state. This allows us to not just construct a diverse set of test environments, but also to use Folktables to study questions aroud distribution shift. For example, we can train a classifier using data from California and then evaluate it on data from Michigan.

from folktables import ACSDataSource, ACSIncome
from sklearn.linear_model import LogisticRegression

data_source = ACSDataSource(survey_year='2018', horizon='1-Year', survey='person')
ca_data = data_source.get_data(states=["CA"], download=True)
mi_data = data_source.get_data(states=["MI"], download=True)
ca_features, ca_labels, _ = ACSIncome.df_to_numpy(ca_data)
mi_features, mi_labels, _ = ACSIncome.df_to_numpy(mi_data)

# Plug-in your method for tabular datasets
model = LogisticRegression()

# Train on CA data
model.fit(ca_features, ca_labels)

# Test on MI data
model.score(mi_features, mi_labels)

Distribution shift across time

In addition to the variation in distributions across states discussed previously, Folktables also contains ACS data for several different years, which itself constitutes a form of temporal distribution shift. For example, we can train a classifier using data from California in 2014 and evaluate how it's equality of opportunity violation or accuracy varies over time.

from folktables import ACSDataSource, ACSPublicCoverage
from sklearn.linear_model import LogisticRegression

# Download 2014 data
data_source = ACSDataSource(survey_year=2014, horizon='1-Year', survey='person')
acs_data14 = data_source.get_data(states=["CA"], download=True)
features14, labels14, _ = ACSPublicCoverage.df_to_numpy(acs_data14)

# Train model on 2014 data
# Plug-in your method for tabular datasets
model = LogisticRegression()
model.fit(features14, labels14)

# Evaluate model on 2015-2018 data
accuracies = []
for year in [2015, 2016, 2017, 2018]:
    data_source = ACSDataSource(survey_year=year, horizon='1-Year', survey='person')
    acs_data = data_source.get_data(states=["CA"], download=True)
    features, labels, _ = ACSPublicCoverage.df_to_numpy(acs_data)
    accuracies.append(model.score(features, labels))

Extract data in CSV with pandas

The data can be easily extracted in CSV format from a pandas dataframe.

from folktables import ACSDataSource, ACSIncome

data_source = ACSDataSource(survey_year='2018', horizon='1-Year', survey='person')
ca_data = data_source.get_data(states=["CA"], download=True)

ca_features, ca_labels, _ = ACSIncome.df_to_pandas(ca_data)

ca_features.to_csv('ca_features.csv', index=False)
ca_labels.to_csv('ca_labels.csv', index=False)

Take a look at the examples for encoding the categorical features with the df_to_pandas method.

Prediction tasks in folktables

Folktables provides the following pre-defined prediction tasks:

ACSIncome: predict whether an individual's income is above $50,000, after filtering the ACS PUMS data sample to only include individuals above the age of 16, who reported usual working hours of at least 1 hour per week in the past year, and an income of at least $100. The threshold of $50,000 was chosen so that this dataset can serve as a comparable replacement to the UCI Adult dataset, but the income threshold can be changed easily to define new prediction tasks.
ACSPublicCoverage: predict whether an individual is covered by public health insurance, after filtering the ACS PUMS data sample to only include individuals under the age of 65, and those with an income of less than $30,000. This filtering focuses the prediction problem on low-income individuals who are not eligible for Medicare.
ACSMobility: predict whether an individual had the same residential address one year ago, after filtering the ACS PUMS data sample to only include individuals between the ages of 18 and 35. This filtering increases the difficulty of the prediction task, as the base rate of staying at the same address is above 90% for the general population.
ACSEmployment: predict whether an individual is employed, after filtering the ACS PUMS data sample to only include individuals between the ages of 16 and 90.
ACSTravelTime: predict whether a

Folktables

Install / Use

README

Table of Contents