AdditiveDecisionTree

Summary

This tool provides an implementation of a decision tree, similar to a standard decision tree such as in sklearn, but utilizing an additive approach to fitting the data. Both AdditiveDecitionTreeClassifier and AdditiveDecisionTreeRegressor classes are provided.

This tool provides, on the whole, comparable accuracy to standard decision trees, but in many cases provides greater accuracy and/or improved interpretability. As such, it can be a useful tool for generating interpretable models and may be considered a useful XAI tool. It is not intended to be competitive with approaches such as boosting or neural networks in terms of accuracy, but is simply a tool to generate interpretable models. It can often produce models comparable in accuracy to deeper standard decision trees, while having a lower overall complexity compared to these.

For an overview, see Medium article: https://medium.com/towards-data-science/additive-decision-trees-85f2feda2223

Limitations of Decision Trees

This tool addresses some well-known limitations of decision trees, in particular their limited stability, their necessity to split based on fewer and fewer samples lower in the trees, repeated sub-trees, and their tendency to overfit if not restricted or pruned. These limitations are typically addressed by ensembling decision trees, either through bagging or boosting, which results in highly uninterpretable, though generally more accurate, models. Constructing oblivious trees (this is done, for example, within CatBoost) and oblique decision trees (see: RotationFeatures) aare other approaches to mitigate some of these limitations, and also often produces more stable trees.

Decision trees are considered to be among the more interpretable models, but only where it is possible to construct them to a shallow depth, perhaps to 4 or 5 levels at most. However, decision trees often need to be fairly deep to acheive higher levels of accuracy, which can greatly undermine their interpretability. As decision trees are likely the most, or among the most, commonly used models where interpretability is required, our comparisions, both in terms of accuracy and interpretability, are made with respect to standard decision trees.

Intuition Behind Additive Decision Trees

The intuition behind AdditiveDecisionTrees in that often the true function (f(x)), mapping the input x to the target y, is based on logical conditions; and in other cases it is simply a probabalistic function where each input feature may be considered somewhat independently (as with the Naive Bayes assumption). For example, the true f(x) may include something to the effect:

If A > 10 Then: y = class Y 
Else if B < 19 Then: y = class X
Else if C * D > 44 Then: y = class Y
Else y = class Z.

In this case, the true f(x) is composed of logical conditions and may be accurately (and in a simple manner) represented as a series of rules, such as in a Decision Tree, Rule List, or Rule Set. Note that conditions may be viewed as interactions, where how one feature predicts the target is depedent on the value of another columns. Here, one rule is based explicitely on the interaction C * D, but all rules entail interactions, as they may fire only if previous rules do not, and therefore the relationships between the features used in these rules is effected by other features.

ON the other hand, the true f(x) may be a set of patterns related to probabilities, more of the form:

The higher A is, the more likely y is to be class X, regardless of B, C and D
The higher B is, up to 100.0, the more likely y is class Y, regardless of A, C and D 
The higher B is, where B is 100.0 or more, the more likely y is to be class Z, regardless of A, C and D
The higher C * D is, the more likely y is class X, regardless of A and B.

Some of these patterns may involve two or more features, and some a single feature. In this form of function, for each instance, the feature values (or combinations of feature values), each contribute some probability to the target value (to each class in the case of classification), and these probabilities are summed to determine the overall probability distribution. Here feature interactions may exist within the probabalistic patterns, as in the case of C * D.

While there are other means to taxonify functions, this system is quite useful, and many true functions may be viewed as some combination of these two broad classes of function.

Standard decision trees do not explicitely assume the true function is conditional, and can accurately (often through the use of very large trees) capture non-conditional relationships such as those based on probabilities. They do, however, model the functions as conditions, which can limit their expressive power and lower their interpretability.

Though f(x) is ultimately a set of rules for any classifiction problem, the rules may be largely independent of each other, each simply a probability distribution based on one or a small number of features.

AdditiveDecisionTrees remove the assumption in standard decision trees that f(x) may be best modeled as a set of conditions, but does support conditions where the data suggests they exist. The central idea is that the true f(x) may be based on logical conditions, probabilities, or some combination of these.

The case where f(x) is based on the features independently (each feature's relationship to the target is not based on any other feature) may be modelled better by linear or logistic regression, Naive Bayes models, or GAM (Generalized Additive Model), among other models, which simply predict based on a weighted sum of each independent feature. That is, each relevant feature contributes to the final prediction without consideration of the other features (though interaction features may be created). f(x), in this case, is simply a probability distribution associated with each input feature. In these cases, linear regression, logistic regression, Naive Bayes, and GAMs can be quite interpretable and may be suitable choices for XAI.

Conversely, linear and logistic regressions do not capture well where there are strong conditions in the function f(x), while decision trees can model these, at least potentially, quite closely. It is usually not know a priori if the true f(x) contains strong conditions and, as such, if it is desirable model the function as a decsion tree does: to repeatedly split the data into subsets and develop a different prediction for each leaf node based entirely on the datapoints within it.

Splitting Policy

We describe here how Additive Decision Trees are constructed, and particularly their splitting policy. Note, the process is simpler to present for classification problems, and so most examples relate to classification, but the ideas apply equally to regression.

The approach taken by AdditiveDecisionTrees is to split the dataspace where appropriate and to make an aggregate decision based on numerous potential splits (all standard axis-parallel splits over different input parameters) where this appears most appropriate. This is done such that the splits appear higher in the tree, where there are larger numbers of samples to base the splits on and they may be found in a more reliable manner, while lower in the tree, where there are less samples to rely on, the decisions are based on a collection of splits, each using the full set of samples in that subset.

This provides for straight-forward explanations for each row (known as local explanations) and for the models as a whole (known as global explanations). Though the final trees may be somewhat more complex than an standard decision tree of equal depth, as some nodes may be based on multiple splits, Additive Decision Trees are more accurate than standard decision trees of equal depth, and simpler than standard decision trees of equal accuracy. The explanations for individual rows may be presented simply through the corresponding decision paths, as with standard decision trees, but the final nodes may be based on averaging over multiple splits. The maximum number of splits aggregated together is configurable, but 4 or 5 is typically sufficient. In most cases, as well, all splits agree, and only one needs to be presented to the user. And in fact, even where the splits disagree, the majority prediction may be presented as a single split. Therefore, the explanations are usually similar as those for standard decision trees, but with shorter decision paths.

This, then, produces a model where there are a small number of splits, ideally representing the true conditions, if any, in the model, followed by additive nodes, which are leaf nodes that average the predictions of multiple splits, providing more robust predictions. This reduces the need to split the data into progressively smaller subsets, each with less statistical significance.

AdditiveDecisionTrees, therefore, provide a simple form of ensembling, but one that still allows a single, interpretable model, easily supporting both global and local explanations. As it still follows a simple tree structure, contrapositive explanations may be easily generated as well.

Intallation

The source code is provided in a single .py file which may be included in any project. It uses no non-standard libraries.

Local Interpretability

In standard decision trees, local explanations (explanations of the prediction for a single instance) are presented as the path from the root to the leaf node where the instance ends, with each split point on the path leading to this final decision. However, this can be misleading and confusing, as very often multiple rules would be just as or more appropriate. As standard decision trees split each node based on a single feature, they select the split that has the greatest information gain. The fact that other splits using other

AdditiveDecisionTree

Install / Use

README

AdditiveDecisionTree

Summary

Limitations of Decision Trees

Intuition Behind Additive Decision Trees

Splitting Policy

Intallation

Local Interpretability