Pyvtreat
vtreat is a data frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner. Distributed under a BSD-3-Clause license.
Install / Use
/learn @WinVector/PyvtreatREADME
This is the Python version of the vtreat data preparation system
(also available as an R package).
vtreat is a DataFrame processor/conditioner that prepares
real-world data for supervised machine learning or predictive modeling
in a statistically sound manner.
Installing
Install vtreat with either of:
pip install vtreatpip install https://github.com/WinVector/pyvtreat/raw/master/pkg/dist/vtreat-0.4.6.tar.gz
Video Introduction
Our PyData LA 2019 talk on vtreat is a good video introduction
to what problems vtreat can be used to solve. The slides can be found here.
Details
vtreat takes an input DataFrame
that has a specified column called "the outcome variable" (or "y")
that is the quantity to be predicted (and must not have missing
values). Other input columns are possible explanatory variables
(typically numeric or categorical/string-valued, these columns may
have missing values) that the user later wants to use to predict "y".
In practice such an input DataFrame may not be immediately suitable
for machine learning procedures that often expect only numeric
explanatory variables, and may not tolerate missing values.
To solve this, vtreat builds a transformed DataFrame where all
explanatory variable columns have been transformed into a number of
numeric explanatory variable columns, without missing values. The
vtreat implementation produces derived numeric columns that capture
most of the information relating the explanatory columns to the
specified "y" or dependent/outcome column through a number of numeric
transforms (indicator variables, impact codes, prevalence codes, and
more). This transformed DataFrame is suitable for a wide range of
supervised learning methods from linear regression, through gradient
boosted machines.
The idea is: you can take a DataFrame of messy real world data and
easily, faithfully, reliably, and repeatably prepare it for machine
learning using documented methods using vtreat. Incorporating
vtreat into your machine learning workflow lets you quickly work
with very diverse structured data.
To get started with vtreat please check out our documentation:
- Getting started using
vtreatfor classification. - Getting started using
vtreatfor regression. - Getting started using
vtreatfor multi-category classification. - Getting started using
vtreatfor unsupervised tasks. - The
vtreatScore Frame (a table mapping new derived variables to original columns). - The original
vtreatpaper this note describes the methodology and theory. (The article describes theRversion, however all of the examples can be found worked inPythonhere).
Some vtreat common capabilities are documented here:
- Score Frame score_frame_, using the
score_frame_information. - Cross Validation Customized Cross Plans, controlling the cross validation plan.
vtreat is available as a Python/Pandas package, and also as an R package.

(logo: Julie Mount, source: “The Harvest” by Boris Kustodiev 1914)
vtreat is used by instantiating one of the classes
vtreat.NumericOutcomeTreatment, vtreat.BinomialOutcomeTreatment, vtreat.MultinomialOutcomeTreatment, or vtreat.UnsupervisedTreatment.
Each of these implements the sklearn.pipeline.Pipeline interfaces
expecting a Pandas DataFrame as input. The vtreat steps are intended to
be a "one step fix" that works well with sklearn.preprocessing stages.
The vtreat Pipeline.fit_transform()
method implements the powerful cross-frame ideas (allowing the same data to be used for vtreat fitting and for later model construction, while
mitigating nested model bias issues).
Background
Even with modern machine learning techniques (random forests, support vector machines, neural nets, gradient boosted trees, and so on) or standard statistical methods (regression, generalized regression, generalized additive models) there are common data issues that can cause modeling to fail. vtreat deals with a number of these in a principled and automated fashion.
In particular vtreat emphasizes a concept called “y-aware
pre-processing” and implements:
- Treatment of missing values through safe replacement plus an indicator column (a simple but very powerful method when combined with downstream machine learning algorithms).
- Treatment of novel levels (new values of categorical variable seen during test or application, but not seen during training) through sub-models (or impact/effects coding of pooled rare events).
- Explicit coding of categorical variable levels as new indicator variables (with optional suppression of non-significant indicators).
- Treatment of categorical variables with very large numbers of levels through sub-models (again impact/effects coding).
- Correct treatment of nested models or sub-models through data split / cross-frame methods (please see here) or through the generation of “cross validated” data frames (see here); these are issues similar to what is required to build statistically efficient stacked models or super-learners).
The idea is: even with a sophisticated machine learning algorithm there are many ways messy real world data can defeat the modeling process, and vtreat helps with at least ten of them. We emphasize: these problems are already in your data, you simply build better and more reliable models if you attempt to mitigate them. Automated processing is no substitute for actually looking at the data, but vtreat supplies efficient, reliable, documented, and tested implementations of many of the commonly needed transforms.
To help explain the methods we have prepared some documentation:
- The vtreat package overall.
- Preparing data for analysis using R white-paper
- The types of new variables introduced by vtreat processing (including how to limit down to domain appropriate variable types).
- Statistically sound treatment of the nested modeling issue introduced by any sort of pre-processing (such as vtreat itself): nested over-fit issues and a general cross-frame solution.
- Principled ways to pick significance based pruning levels.
Example
This is an supervised classification example taken from the KDD 2009 cup. A copy of the data and details can be found here: https://github.com/WinVector/PDSwR2/tree/master/KDD2009. The problem was to predict account cancellation ("churn") from very messy data (column names not given, numeric and categorical variables, many missing values, some categorical variables with a large number of possible levels). In this example we show how to quickly use vtreat to prepare the data for modeling. vtreat takes in Pandas DataFrames and returns both a treatment plan and a clean Pandas DataFrame ready for modeling.
to install
!pip install vtreat !pip install wvpy Load our packages/modules.
import pandas
import xgboost
import vtreat
import vtreat.cross_plan
import numpy.random
import wvpy.util
import scipy.sparse
Read in explanitory variables.
# data from https://github.com/WinVector/PDSwR2/tree/master/KDD2009
dir = "../../../PracticalDataScienceWithR2nd/PDSwR2/KDD2009/"
d = pandas.read_csv(dir + 'orange_small_train.data.gz', sep='\t', header=0)
vars = [c for c in d.columns]
d.shape
(50000, 230)
Read in dependent variable we are trying to predict.
churn = pandas.read_csv(dir + 'orange_small_train_churn.labels.txt', header=None)
churn.columns = ["churn"]
churn.shape
(50000, 1)
churn["churn"].value_counts()
-1 46328
1 3672
Name: churn, dtype: int64
Arrange test/train split.
numpy.random.seed(855885)
n = d.shape[0]
# https://github
