Isotree
(Python, R, C/C++) Isolation Forest and variations such as SCiForest and EIF, with some additions (outlier detection + similarity + NA imputation)
Install / Use
/learn @david-cortes/IsotreeREADME
IsoTree
Fast and multi-threaded implementation of Isolation Forest (a.k.a. iForest) and variations of it such as Extended Isolation Forest (EIF), Split-Criterion iForest (SCiForest), Fair-Cut Forest (FCF), Robust Random-Cut Forest (RRCF), and other customizable variants, aimed at outlier/anomaly detection plus additions for imputation of missing values, distance/similarity calculation between observations, and handling of categorical data. Written in C++ with interfaces for Python, R, and C. An additional wrapper for Ruby can be found here.
The new concepts in this software are described in:
- Revisiting randomized choices in isolation forests
- Isolation forests: looking beyond tree depth
- Distance approximation using Isolation Forests
- Imputing missing values with unsupervised random trees
For a quick introduction to the Isolation Forest concept as used in this library, see:
Short Python example notebooks:
- General library usage.
- Using it as imputer in a scikit-learn pipeline.
- Using it as a kernel for SVMs.
- Converting it to TreeLite format for faster predictions.
(R examples are available in the internal documentation)
Description
Isolation Forest is an algorithm originally developed for outlier detection that consists in splitting sub-samples of the data according to some attribute/feature/column at random. The idea is that, the rarer the observation, the more likely it is that a random uniform split on some feature would put outliers alone in one branch, and the fewer splits it will take to isolate an outlier observation like this. The concept is extended to splitting hyperplanes in the extended model (i.e. splitting by more than one column at a time), and to guided (not entirely random) splits in the SCiForest model that aim at isolating outliers faster and finding clustered outliers.
Note that this is a black-box model that will not produce explanations or importances - for a different take on explainable outlier detection see OutlierTree.

(Code to produce these plots can be found in the R examples in the documentation)
Comparison against other libraries
The folder timings contains a speed comparison against other Isolation Forest implementations in Python (SciKit-Learn, EIF) and R (IsolationForest, isofor, solitude). From the benchmarks, IsoTree tends to be at least 1 order of magnitude faster than the libraries compared against in both single-threaded and multi-threaded mode.
Example timings for 100 trees and different sample sizes, CovType dataset - see the link above for full benchmark and details:
| Library | Model | Time (s) 256 | Time (s) 1024 | Time (s) 10k | | :---: | :---: | :---: | :---: | :---: | | isotree | orig | 0.00161 | 0.00631 | 0.0848 | | isotree | ext | 0.00326 | 0.0123 | 0.168 | | eif | orig | 0.149 | 0.398 | 4.99 | | eif | ext | 0.16 | 0.428 | 5.06 | | h2o | orig | 9.33 | 11.21 | 14.23 | | h2o | ext | 1.06 | 2.07 | 17.31 | | scikit-learn | orig | 8.3 | 8.01 | 6.89 | | solitude | orig | 32.612 | 34.01 | 41.01 |
Example AUC as outlier detector in typical datasets (notebook to produce results here):
- Satellite dataset:
| Library | AUROC defaults | AUROC grid search | | :---: | :---: | :---: | | isotree | 0.70 | 0.84 | | eif | - | 0.714 | | scikit-learn | 0.687 | 0.74 | | h2o | 0.662 | 0.748 |
- Annthyroid dataset:
| Library | AUROC defaults | AUROC grid search | | :---: | :---: | :---: | | isotree | 0.80 | 0.982 | | eif | - | 0.808 | | scikit-learn | 0.836 | 0.836 | | h2o | 0.80 | 0.80 |
(Disclaimer: these are rather small datasets and thus these AUC estimates have high variance)
Non-random splits
While the original idea behind isolation forests consisted in deciding splits uniformly at random, it's possible to get better performance at detecting outliers in some datasets (particularly those with multimodal distributions) by determining splits according to an information gain criterion instead. The idea is described in "Revisiting randomized choices in isolation forests" along with some comparisons of different split guiding criteria.
Different outlier scoring criteria
Although the intuition behind the algorithm was to look at the tree depth required for isolation, this package can also produce outlier scores based on density criteria, which provide improved results in some datasets, particularly when splitting on categorical features. The idea is described in "Isolation forests: looking beyond tree depth".
Distance / similarity calculations
General idea was extended to produce distance (alternatively, similarity) between observations according to how many random splits it takes to separate them - idea is described in "Distance approximation using Isolation Forests".
Imputation of missing values
The model can also be used to impute missing values in a similar fashion as kNN, by taking the values from observations in the terminal nodes of each tree in which an observation with missing values falls at prediction time, combining the non-missing values of the other observations as a weighted average according to the depth of the node and the number of observations that fall there. This is not related to how the model handles missing values internally, but is rather meant as a faster way of imputing by similarity. Quality is usually not as good as chained equations, but the method is a lot faster and more scalable. Recommended to use non-random splits when used as an imputer. Details are described in "Imputing missing values with unsupervised random trees".
Highlights
There's already many available implementations of isolation forests for both Python and R (such as the one from the original paper's authors' or the one in SciKit-Learn), but at the time of writing, all of them are lacking some important functionality and/or offer sub-optimal speed. This particular implementation offers the following:
- Implements the extended model (with splitting hyperplanes) and split-criterion model (with non-random splits).
- Can handle missing values (but performance with them is not so good).
- Can handle categorical variables (one-hot/dummy encoding does not produce the same result).
- Can use a mixture of random and non-random splits, and can split by weighted/pooled gain (in addition to simple average).
- Can produce approximated pairwise distances between observations according to how many steps it takes on average to separate them down the tree.
- Can calculate isolation kernels or proximity matrix, which counts the proportion of trees in which two given observations end up in the same terminal node.
- Can produce missing value imputations according to observations that fall on each terminal node.
- Can work with sparse matrices.
- Can use either depth-based metrics or density-based metrics for calculation of outlier scores.
- Supports sample/observation weights, either as sampling importance or as distribution density measurement.
- Supports user-provided column sample weights.
- Can sample columns randomly with weights given by kurtosis.
- Uses exact formula (not approximation as others do) for harmonic numbers at lower sample and remainder sizes, and a higher-order approximation for larger sizes.
- Can fit trees incrementally to user-provided data samples.
- Produces serializable model objects with reasonable file sizes.
- Can convert the models to
treeliteformat (Python-only and depending on the parameters that are used) (example here). - Can translate the generated trees into SQL statements.
- Fast and multi-threaded C++ code with an ISO C interface, which is architecture-agnostic, multi-platform, and with the only external dependency (Robin-Map) being optional. Can be wrapped in languages other than Python/R/Ruby.
(Note that categoricals, NAs, and density-like sample weights, are
Related Skills
node-connect
339.5kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
83.9kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
339.5kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
83.9kCommit, push, and open a PR
