Zrp
Zest Race Predictor
Install / Use
/learn @zestai/ZrpREADME
Zest Race Predictor
Zest Race Predictor (ZRP) is an open-source machine learning algorithm that estimates the race/ethnicity of an individual using only their full name and home address as inputs. ZRP improves upon the most widely used racial and ethnic data estimation method, Bayesian Improved Surname Geocoding (BISG), developed by RAND Corporation in 2009.
ZRP was built using ML techniques such as gradient boosting and trained on voter data from the southeastern U.S. It was then validated on a national sample using adjusted tract-level American Community Survey (ACS) data. (Model training procedures are provided.)
Compared to BISG, ZRP correctly identified:
- 25% more African-Americans as African-American
- 35% fewer African-Americans as non-African American
- 60% fewer Whites as non-White
ZRP can be used to analyze racial equity and outcomes in critical spheres such as health care, financial services, criminal justice, or anywhere there's a need to impute the race or ethnicity of a population dataset. (Usage examples are included.) The financial services industry, for example, has struggled for years to achieve more equitable outcomes amid charges of discrimination in lending practices.
Zest AI began developing ZRP in 2020 to improve the accuracy of our clients' fair lending analyses by using more data and better math. We believe ZRP can greatly improve our understanding of the disparate impact and disparate treatment of protected-status borrowers. Armed with a better understanding of the disparities that exist in our financial system, we can highlight inequities and create a roadmap to improve equity in access to finance.
Notes
This is the preliminary version and implementation of the ZRP tool. We're dedicated to continue improving both the algorithm and documentation and hope that government agencies, lenders, citizen data scientists and other interested parties will help us improve the model. Details of the model development process can be found in the model development documentation
Install
Install requires an internet connection. The package has been tested on python 3.7.7, but should likely work with 3.7.X.
Note: Due to the size and number of lookup tables necesary for the zrp package, total installation requires 3 GB of available space.
Setting up your virtual environment
We recommend installing zrp inside a python virtual environment.
Run the following to build your virtual envrionment:
python3 -m venv /path/to/new/virtual/environment
Activate your virtual environment:
source /path/to/new/virtual/environment/bin/activate
Ex.:
python -m venv /Users/joejones/Documents/ZestAI/zrpvenv
source /Users/joejones/Documents/ZestAI/zrpvenv/bin/activate
General Installation
pip install zrp
After installing via pip, you need to download the lookup tables and pipelines using the following command: :
python -m zrp download
If you're experiencing issues with installation, please consult our troubleshooting help page.
Advanced Installation
*Required only if processing the data from scratch instead of using existing ZRP data
Unix-like systems
pip install fiona
pip install zrp
After installing via pip, you need to download the lookup tables and pipelines using the following command: :
python -m zrp download
Windows
pip install pipwin
pipwin install gdal
pipwin install fiona
pip install zrp
After installing via pip, you need to download the lookup tables and pipelines using the following command: :
python -m zrp download
If you're experiencing issues with installation, please consult our troubleshooting help page.
Data
Training Data
The models available in this package were trained on voter registration data from the states of Florida , Georgia, and North Carolina. Summary statistics on these datasets and additional datasets used as validation can be found here .
Consult the following to download state voter registration data:
American Community Survey (ACS) Data:
The US Census Bureau details that, "the American Community Survey (ACS) is an ongoing survey that provides data every year -- giving communities the current information they need to plan investments and services. The ACS covers a broad range of topics about social, economic, demographic, and housing characteristics of the U.S. population. The 5-year estimates from the ACS are "period" estimates that represent data collected over a period of time. The primary advantage of using multiyear estimates is the increased statistical reliability of the data for less populated areas and small population subgroups. The 5-year estimates are available for all geographies down to the block group level." ( Bureau, US Census. "American Community Survey 5-Year Data (2009-2019)." Census.gov, 8 Dec. 2021, https://www.census.gov/data/developers/data-sets/acs-5year.html. )
ACS data is available in 1 or 5 year spans. The 5yr ACS data is the most comprehensive & is available at more granular levels than 1yr data. It is thus used in this work.
Model Development and Feature Documentation
Details of the model development process can be found in the model development documentation . Details of the human readable feature definitions as well as feature importances can be found here.
Usage and Examples
To get started using the ZRP, first ensure the download is complete (as described above) and xgboost == 1.0.2
Check out the guides in the examples folder. Clone the repo in order to obtain the example notebooks and data; this is not provided in the pip installable package. If you're experiencing issues, first consult our troubleshooting help guide .
Here, we additionally provide an interactive virtual environment, via Binder, with ZRP installed. Once you open this link and are taken to the JupyterLab environment, open up a terminal and run the following: :
python -m zrp download
Next, we present the primary ways you'll use ZRP.
ZRP Predictions
Summary of commands: :
>>> from zrp import ZRP
>>> zest_race_predictor = ZRP()
>>> zest_race_predictor.fit()
>>> zrp_output = zest_race_predictor.transform(input_dataframe)
Breaking down key commands :
>>> zest_race_predictor = ZRP()
-
ZRP(pipe_path=None, support_files_path="data/processed", key="ZEST_KEY", first_name="first_name", middle_name="middle_name", last_name="last_name", house_number="house_number", street_address="street_address", city="city", state="state", zip_code="zip_code", race='race', proxy="probs", census_tract=None, street_address_2=None, name_prefix=None, name_suffix=None, na_values=None, file_path=None, geocode=True, bisg=True, readout=True, n_jobs=49, year="2019", span="5", runname="test")
- What it does:
- Prepares data to generate race & ethnicity proxies
You can find parameter descriptions in the ZRP class and it's parent class.
- What it does:
>>> zrp_output = zest_race_predictor.transform(input_dataframe)
- zest_race_predictor.transform(df)
- What it does:
- Processes input data and generates ZRP proxy predictions.
- Attempts to predict on block group, then census tract, then zip code based on which level ACS data is found for. If Geo level data is unattainable, the BISG proxy is computed. No prediction returned if BISG cannot be computed either.
- What it does:
Parameters
df : {DataFrame} Pandas dataframe containing input data (see below for necessary columns)
Input data, df, into the prediction/modeling pipeline MUST contain the following columns: first name, middle name, last name, house number, street address (street name), city, state, zip code, and zest key. Consult ou
Related Skills
node-connect
343.3kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
92.1kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
343.3kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
343.3kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
