Transformations and Interactions for Linear Models

The first automated package for data driven feature transformation, interaction, and selection to develop fast linear models.

Install

pip install tflm

The package is nice and simple and boils down to one command.

import tflm
target = "target"
Train_lin_X,Train_lin_y, Test_lin_X, Test_lin_y = tflm.runner(Train, Test, Target)

## If you run into obstacles (i.e. too many features for multivariate regression/low memory) ...
## ...adjust the following default parameters lower (between 0.0 and 1.0) for now this is experimental ...
## ...contribution_portion=0.9, final_contribution=0.95, deflator=0.7
## ...and a few others, very experimental: inter_portion=0.8, sqr_portion=0.8, runs=2

Now just add it to your linear model

from sklearn import linear_model
lm = linear_model.LinearRegression()
lm = lm.fit(Train_lin_X, Train_lin_y)

For now this only works with regression problems (continuous targets)

Description

This advanced and automated model generates and selects feature tranformations and interactions using MLP neural networks with four single standing transformations (power, log, reciprocal and roots) and a Gradient Boosting Model with two interaction methods (multiplication and division) both using shapley additive contribution scores for selection criteria, the benefit of which is a selection of generated features that immitates neural networks and decision trees which is known to have synergetic ensembling properties. The final selection based on a validation set uses a the Least-angle regression (LARS) algorithm.

The amount of feature can greatly inflate depending on the qulity of interactive effects and the benefits obtained from transformations. Although eight hyperparameters are made availble, for now I have chosen the parameters automatically based on data characteristics. These parameters in their current development state are fragile and on top of that hyperparameter selection is extremely expensive with this method. Each iteration passes through four selection filters. The data can pass multiple times through the tflm method, for now I have internally capped the iterations at two. This is an extremely slow algorithm, with the purpose of using a lot of time upfront to create good features that you can use a fast linear model in the future as opposed to a slow non-linear model.

In data analysis, transformation is the replacement of a variable by a function of that variable: for example, replacing a variable x by the square root of x or the logarithm of x. In a stronger sense, a transformation is a replacement that changes the shape of a distribution or relationship. Interactions arise when considering the relationship among three or more variables. It describes a situation in which the effect of one variable on an outcome depends on the state of a second variable. Interaction terms can be created in various ways such as the product of x and y or the ratio of x and y.

Use Cases

General Automated Feature Generation for Linear Models and Gradient Boosting Models (LightGBM, CatBoost, XGBoost)
Transformation of Higher-Dimensional Feature Space to Lower-Dimensional Feature Space.
Features Automatically Generated and Selected to Imitate the Performance of Non-linear models
Linear Models are Needed at Times When Latency Becomes an Important Concern

How

MLP Neural Network Identifies the Most Important Features for Interaction and Selection
All Feature Importance and Feature Interaction Values are Based on SHAP (SHapley Additive exPlanations)
The Most Important Single Standing Features are Tranformed POWER_2 (square) LOG (log plus 1) RECIP (reciprocal) SQRT (square root plus 1)
GBM Gradient Boosting Model uses the MLP Identified Important Features to Select a Subset of Important Interaction Pairs
The Most Important Interaction Pairs are Interacted a_X_b (multiplication) c_DIV_h (division)
All Transformations are Fed as Input into an MLP model and Selected to X% (default 90%) Feature Contribution
The Whole Process is Repeated One More Time So That Higher Dimensional Interaction Can Take Place imagine a_POWER_b_X_c_DIV_h
Finally a Lasso Regression Selects Features from a Validation Set Using the LARS algorithm

To Do

Current parameter selection is based on data characteristics and bayesian hyperparameter optimisation could help.
The AutoKeras team told me they are working on an automated model for tabular regression problems.
Method for undoing interactions and transformations to identify original feature importance.
Develop a method for classification tasks.
Optimisation for users without access to GPUs (for now, you can use model="LightGBM" paramater).
Make each generation a little less random.

First Example

We have a dataset of more than 500k songs. The task is to predict which year

Second Example

Download Dataset and Activate Runner

import tflm
import sklearn.datasets
from sklearn import linear_model

dataset = sklearn.datasets.fetch_california_housing()
X = pd.DataFrame(dataset['data'])
X["target"] = dataset["target"]
first = X.sample(int(len(X)/2))  # random selection leading to different scores
second = X[~X.isin(first)].dropna()
target = "target"

X_train, y_train, X_test, y_test = tflm.runner(first, second, target)
#train_data, train_output, test_data, test_output = runner(first, second, target, contribution_portion=0.7, final_contribution=0.80, deflator=0.6)

Modelling and MSE Score


from sklearn import linear_model
lm = linear_model.LinearRegression()
lm = lm.fit(X_train,y_train)
preds = lm.predict(X_test)

mse = mean_squared_error(y_test, preds)
print(mse)
#Score Achieved = 0.43

Compare Performance With Untransformed Features


import pandas as pd
from sklearn import preprocessing

def scaler(df):
  x = df.values #returns a numpy array
  min_max_scaler = preprocessing.MinMaxScaler()
  x_scaled = min_max_scaler.fit_transform(x)
  df = pd.DataFrame(x_scaled)
  return df

add_first_y = first[target]
add_first = scaler(first.drop([target],axis=1))

add_second_y = second[target]
add_second = scaler(second.drop([target],axis=1)) 

from sklearn import linear_model
#clf = linear_model.Lasso(alpha=0.4)
clf = linear_model.LinearRegression()
preds = clf.fit(add_first,add_first_y).predict(add_second)
mse = mean_squared_error(add_second_y, preds)
print(mse)
#Score Achieved = 0.55

That is a performance improvement of more than 20% by using exactely the same data !!

That does not mean it always performs better than the standard data format; here is a Google Colab example where this method performs poorly because of a lack of data. Here is works okay, colab.

Reasons

There are many reasons for transformation. In practice, a transformation often works, serendipitously, to do several of these at once, particularly to reduce skewness, to produce nearly equal spreads and to produce a nearly linear or additive relationship. But this is not guaranteed.

Convenience: A transformed scale may be as natural as the original scale and more convenient for a specific purpose (e.g. percentages rather than original data, sines rather than degrees).
Reducing skewness: A transformation may be used to reduce skewness. A distribution that is symmetric or nearly so is often easier to handle and interpret than a skewed distribution. More specifically, a normal or Gaussian distribution is often regarded as ideal as it is assumed by many statistical methods.To reduce right skewness, take roots or logarithms or reciprocals (roots are weakest). This is the commonest problem in practice. To reduce left skewness, take squares or cubes or higher powers.
Equal spreads: A transformation may be used to produce approximately equal spreads, despite marked variations in level, which again makes data easier to handle and interpret. Each data set or subset having about the same spread or variability is a condition called homoscedasticity: its opposite is called heteroscedasticity.
Linear relationships: When looking at relationships between variables, it is often far easier to think about patterns that are approximately linear than about patterns that are highly curved. This is vitally important when using linear regression, which amounts to fitting such patterns to data.
Additive relationships Relationships are often easier to analyse when additive rather than (say) multiplicative. Additivity is a vital issue in analysis of variance.

Transformations Implemented

The most useful transformations in introductory data analysis are the reciprocal, logarithm, cube root, square root, and square. In what follows, even when it is not emphasised, it is supposed that transformations are used only over ranges on which they yield (finite) real numbers as results.

Reciprocal


The reciprocal, x to 1/x, with its sibling the negative reciprocal, x to
-1/x, is a very strong transformation with a drastic effect on
distribution shape. It can not be applied to zero values.  Although it
can be applied to negative values, it is not useful unless all values are
positive. The reciprocal of a ratio may often be interpreted as easily as
the ratio itself: e.g.


    population density (people per unit area) becomes area per person;


    persons per doctor becomes doctors per person;


    rates of erosion become time to erode a unit depth.


(In practice, we might want to multiply or divide the results of taking
the reciprocal by some constant, such as 1000 or 10000, to get numbers
that are eas

Tflm

Install / Use

README