ppscore - a Python implementation of the Predictive Power Score (PPS)

From the makers of bamboolib - a GUI for pandas DataFrames

If you don't know yet what the Predictive Power Score is, please read the following blog post:

RIP correlation. Introducing the Predictive Power Score

The PPS is an asymmetric, data-type-agnostic score that can detect linear or non-linear relationships between two columns. The score ranges from 0 (no predictive power) to 1 (perfect predictive power). It can be used as an alternative to the correlation (matrix).

Installation
Getting started
API
Calculation of the PPS
About

Installation

You need Python 3.8 or above.

From the terminal (or Anaconda prompt in Windows), enter:

pip install -U ppscore

Getting started

The examples refer to the newest version (1.2.0) of ppscore. See changes

First, let's create some data:

import pandas as pd
import numpy as np
import ppscore as pps

df = pd.DataFrame()
df["x"] = np.random.uniform(-2, 2, 1_000_000)
df["error"] = np.random.uniform(-0.5, 0.5, 1_000_000)
df["y"] = df["x"] * df["x"] + df["error"]

Based on the dataframe we can calculate the PPS of x predicting y:

pps.score(df, "x", "y")

We can calculate the PPS of all the predictors in the dataframe against a target y:

pps.predictors(df, "y")

Here is how we can calculate the PPS matrix between all columns:

pps.matrix(df)

Visualization of the results

For the visualization of the results you can use seaborn or your favorite viz library.

Plotting the PPS predictors:

import seaborn as sns
predictors_df = pps.predictors(df, y="y")
sns.barplot(data=predictors_df, x="x", y="ppscore")

Plotting the PPS matrix:

(This needs some minor preprocessing because seaborn.heatmap unfortunately does not accept tidy data)

import seaborn as sns
matrix_df = pps.matrix(df)[['x', 'y', 'ppscore']].pivot(columns='x', index='y', values='ppscore')
sns.heatmap(matrix_df, vmin=0, vmax=1, cmap="Blues", linewidths=0.5, annot=True)

API

ppscore.score(df, x, y, sample=5_000, cross_validation=4, random_seed=123, invalid_score=0, catch_errors=True)

Calculate the Predictive Power Score (PPS) for "x predicts y"

The score always ranges from 0 to 1 and is data-type agnostic.
A score of 0 means that the column x cannot predict the column y better than a naive baseline model.
A score of 1 means that the column x can perfectly predict the column y given the model.
A score between 0 and 1 states the ratio of how much potential predictive power the model achieved compared to the baseline model.

Parameters

df : pandas.DataFrame
- Dataframe that contains the columns x and y
x : str
- Name of the column x which acts as the feature
y : str
- Name of the column y which acts as the target
sample : int or None
- Number of rows for sampling. The sampling decreases the calculation time of the PPS. If None there will be no sampling.
cross_validation : int
- Number of iterations during cross-validation. This has the following implications: For example, if the number is 4, then it is possible to detect patterns when there are at least 4 times the same observation. If the limit is increased, the required minimum observations also increase. This is important, because this is the limit when sklearn will throw an error and the PPS cannot be calculated
random_seed : int or None
- Random seed for the parts of the calculation that require random numbers, e.g. shuffling or sampling. If the value is set, the results will be reproducible. If the value is None a new random number is drawn at the start of each calculation.
invalid_score : any
- The score that is returned when a calculation is not valid, e.g. because the data type was not supported.
catch_errors : bool
- If True all errors will be catched and reported as unknown_error which ensures convenience. If False errors will be raised. This is helpful for inspecting and debugging errors.

Returns

Dict:
- A dict that contains multiple fields about the resulting PPS. The dict enables introspection into the calculations that have been performed under the hood

ppscore.predictors(df, y, output="df", sorted=True, **kwargs)

Calculate the Predictive Power Score (PPS) for all columns in the dataframe against a target (y) column

Parameters

df : pandas.DataFrame
- The dataframe that contains the data
y : str
- Name of the column y which acts as the target
output : str - potential values: "df", "list"
- Control the type of the output. Either return a df or a list with all the PPS score dicts
sorted : bool
- Whether or not to sort the output dataframe/list by the ppscore
kwargs :
- Other key-word arguments that shall be forwarded to the pps.score method, e.g. sample, cross_validation, random_seed, invalid_score, catch_errors

Returns

pandas.DataFrame or list of PPS dicts:
- Either returns a df or a list of all the PPS dicts. This can be influenced by the output argument

ppscore.matrix(df, output="df", sorted=False, **kwargs)

Calculate the Predictive Power Score (PPS) matrix for all columns in the dataframe

Parameters

df : pandas.DataFrame
- The dataframe that contains the data
output : str - potential values: "df", "list"
- Control the type of the output. Either return a df or a list with all the PPS score dicts
sorted : bool
- Whether or not to sort the output dataframe/list by the ppscore
kwargs :
- Other key-word arguments that shall be forwarded to the pps.score method, e.g. sample, cross_validation, random_seed, invalid_score, catch_errors

Returns

pandas.DataFrame or list of PPS dicts:
- Either returns a df or a list of all the PPS dicts. This can be influenced by the output argument

Calculation of the PPS

If you are uncertain about some details, feel free to jump into the code to have a look at the exact implementation

There are multiple ways how you can calculate the PPS. The ppscore package provides a sample implementation that is based on the following calculations:

The score is calculated using only 1 feature trying to predict the target column. This means there are no interaction effects between the scores of various features. Note that this is in contrast to feature importance
The score is calculated on the test sets of a 4-fold cross-validation (number is adjustable via cross_validation). For classification, stratifiedKFold is used. For regression, normal KFold. Please note that this sampling might not be valid for time series data sets
All rows which have a missing value in the feature or the target column are dropped
In case that the dataset has more than 5,000 rows the score is only calculated on a random subset of 5,000 rows. You can adjust the number of rows or skip this sampling via sample. However, in most scenarios the results will be very similar
There is no grid search for optimal model parameters
The result might change between calculations because the calculation contains random elements, e.g. the sampling of the rows or the shuffling of the rows before cross-validation. If you want to make sure that your results are reproducible you can set the random seed (random_seed).
If the score cannot be calculated, the package will not raise an error but return an object where is_valid_score is False. The reported score will be invalid_score. We chose this behavior because we want to give you a quick overview where significant predictive power exists without you having to handle errors or edge cases. However, when you want to explicitly handle the errors, you can still do so.

Learning algorithm

As a learning algorithm, we currently use a Decision Tree because the Decision Tree has the following properties:

can detect any non-linear bivariate relationship
good predictive power in a wide variety of use cases
low requirements for feature preprocessing
robust model which can handle outliers and does not easily overfit
can be used for classification and regression
can be calculated quicker than many other algorithms

We differentiate the exact implementation based on the data type of the target column:

If the target column is numeric, we use the sklearn.DecisionTreeRegressor
If the target column is categoric, we use the sklearn.DecisionTreeClassifier

Please note that we prefer a general good performance on a wide variety of use cases over better performance in some narrow use cases. If you have a proposal for a better/different learning algorithm, please open an issue

However, please note why we actively decided against the following algorithms:

Correlation or Linear Regression: cannot detect non-linear bivariate relationships without extensive preprocessing
GAMs: might have problems with very unsmooth functions
SVM: potentially bad performance if the wrong kernel is selected
Random Forest/Gradient Boosted Tree: slower than a single Decision Tree
Neural Networks and Deep Learning: slower calculation than a Decision Tree and also needs more feature preprocessing

Data preprocessing

Even though the Decision Tree is a very flexible learning algorithm, we need to perform the following preprocessing steps if a column represents categoric values - that means it has the pandas dtype `ob

Ppscore

Install / Use

README

ppscore - a Python implementation of the Predictive Power Score (PPS)

From the makers of bamboolib - a GUI for pandas DataFrames

Installation

Getting started

Visualization of the results

API

ppscore.score(df, x, y, sample=5_000, cross_validation=4, random_seed=123, invalid_score=0, catch_errors=True)

Parameters

Returns

ppscore.predictors(df, y, output="df", sorted=True, **kwargs)

Parameters

Returns

ppscore.matrix(df, output="df", sorted=False, **kwargs)

Parameters

Returns

Calculation of the PPS

Learning algorithm

Data preprocessing