Isotree

(Python, R, C/C++) Isolation Forest and variations such as SCiForest and EIF, with some additions (outlier detection + similarity + NA imputation)

Generate Convert Improve

Install / Use

/learn @david-cortes/Isotree

About this skill

Quality Score

0/100

README

IsoTree

Fast and multi-threaded implementation of Isolation Forest (a.k.a. iForest) and variations of it such as Extended Isolation Forest (EIF), Split-Criterion iForest (SCiForest), Fair-Cut Forest (FCF), Robust Random-Cut Forest (RRCF), and other customizable variants, aimed at outlier/anomaly detection plus additions for imputation of missing values, distance/similarity calculation between observations, and handling of categorical data. Written in C++ with interfaces for Python, R, and C. An additional wrapper for Ruby can be found here.

The new concepts in this software are described in:

For a quick introduction to the Isolation Forest concept as used in this library, see:

Short Python example notebooks:

(R examples are available in the internal documentation)

Description

Isolation Forest is an algorithm originally developed for outlier detection that consists in splitting sub-samples of the data according to some attribute/feature/column at random. The idea is that, the rarer the observation, the more likely it is that a random uniform split on some feature would put outliers alone in one branch, and the fewer splits it will take to isolate an outlier observation like this. The concept is extended to splitting hyperplanes in the extended model (i.e. splitting by more than one column at a time), and to guided (not entirely random) splits in the SCiForest model that aim at isolating outliers faster and finding clustered outliers.

Note that this is a black-box model that will not produce explanations or importances - for a different take on explainable outlier detection see OutlierTree.

(Code to produce these plots can be found in the R examples in the documentation)

Comparison against other libraries

The folder timings contains a speed comparison against other Isolation Forest implementations in Python (SciKit-Learn, EIF) and R (IsolationForest, isofor, solitude). From the benchmarks, IsoTree tends to be at least 1 order of magnitude faster than the libraries compared against in both single-threaded and multi-threaded mode.

Example timings for 100 trees and different sample sizes, CovType dataset - see the link above for full benchmark and details:

| Library | Model | Time (s) 256 | Time (s) 1024 | Time (s) 10k | | :---: | :---: | :---: | :---: | :---: | | isotree | orig | 0.00161 | 0.00631 | 0.0848 | | isotree | ext | 0.00326 | 0.0123 | 0.168 | | eif | orig | 0.149 | 0.398 | 4.99 | | eif | ext | 0.16 | 0.428 | 5.06 | | h2o | orig | 9.33 | 11.21 | 14.23 | | h2o | ext | 1.06 | 2.07 | 17.31 | | scikit-learn | orig | 8.3 | 8.01 | 6.89 | | solitude | orig | 32.612 | 34.01 | 41.01 |

Example AUC as outlier detector in typical datasets (notebook to produce results here):

Satellite dataset:

| Library | AUROC defaults | AUROC grid search | | :---: | :---: | :---: | | isotree | 0.70 | 0.84 | | eif | - | 0.714 | | scikit-learn | 0.687 | 0.74 | | h2o | 0.662 | 0.748 |

Annthyroid dataset:

| Library | AUROC defaults | AUROC grid search | | :---: | :---: | :---: | | isotree | 0.80 | 0.982 | | eif | - | 0.808 | | scikit-learn | 0.836 | 0.836 | | h2o | 0.80 | 0.80 |

(Disclaimer: these are rather small datasets and thus these AUC estimates have high variance)

Non-random splits

While the original idea behind isolation forests consisted in deciding splits uniformly at random, it's possible to get better performance at detecting outliers in some datasets (particularly those with multimodal distributions) by determining splits according to an information gain criterion instead. The idea is described in "Revisiting randomized choices in isolation forests" along with some comparisons of different split guiding criteria.

Different outlier scoring criteria

Although the intuition behind the algorithm was to look at the tree depth required for isolation, this package can also produce outlier scores based on density criteria, which provide improved results in some datasets, particularly when splitting on categorical features. The idea is described in "Isolation forests: looking beyond tree depth".

Distance / similarity calculations

General idea was extended to produce distance (alternatively, similarity) between observations according to how many random splits it takes to separate them - idea is described in "Distance approximation using Isolation Forests".

Imputation of missing values

The model can also be used to impute missing values in a similar fashion as kNN, by taking the values from observations in the terminal nodes of each tree in which an observation with missing values falls at prediction time, combining the non-missing values of the other observations as a weighted average according to the depth of the node and the number of observations that fall there. This is not related to how the model handles missing values internally, but is rather meant as a faster way of imputing by similarity. Quality is usually not as good as chained equations, but the method is a lot faster and more scalable. Recommended to use non-random splits when used as an imputer. Details are described in "Imputing missing values with unsupervised random trees".

Highlights

There's already many available implementations of isolation forests for both Python and R (such as the one from the original paper's authors' or the one in SciKit-Learn), but at the time of writing, all of them are lacking some important functionality and/or offer sub-optimal speed. This particular implementation offers the following:

Implements the extended model (with splitting hyperplanes) and split-criterion model (with non-random splits).
Can handle missing values (but performance with them is not so good).
Can handle categorical variables (one-hot/dummy encoding does not produce the same result).
Can use a mixture of random and non-random splits, and can split by weighted/pooled gain (in addition to simple average).
Can produce approximated pairwise distances between observations according to how many steps it takes on average to separate them down the tree.
Can calculate isolation kernels or proximity matrix, which counts the proportion of trees in which two given observations end up in the same terminal node.
Can produce missing value imputations according to observations that fall on each terminal node.
Can work with sparse matrices.
Can use either depth-based metrics or density-based metrics for calculation of outlier scores.
Supports sample/observation weights, either as sampling importance or as distribution density measurement.
Supports user-provided column sample weights.
Can sample columns randomly with weights given by kurtosis.
Uses exact formula (not approximation as others do) for harmonic numbers at lower sample and remainder sizes, and a higher-order approximation for larger sizes.
Can fit trees incrementally to user-provided data samples.
Produces serializable model objects with reasonable file sizes.
Can convert the models to treelite format (Python-only and depending on the parameters that are used) (example here).
Can translate the generated trees into SQL statements.
Fast and multi-threaded C++ code with an ISO C interface, which is architecture-agnostic, multi-platform, and with the only external dependency (Robin-Map) being optional. Can be wrapped in languages other than Python/R/Ruby.

(Note that categoricals, NAs, and density-like sample weights, are

Related Skills

node-connect

339.5k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

83.9k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

339.5k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

commit-push-pr

83.9k

Commit, push, and open a PR