PyNomaly
Anomaly detection using LoOP: Local Outlier Probabilities, a local density based outlier detection method providing an outlier score in the range of [0,1].
Install / Use
/learn @vc1492a/PyNomalyREADME
PyNomaly
PyNomaly is a Python 3 implementation of LoOP (Local Outlier Probabilities). LoOP is a local density based outlier detection method by Kriegel, Kröger, Schubert, and Zimek which provides outlier scores in the range of [0,1] that are directly interpretable as the probability of a sample being an outlier.
PyNomaly is a core library of deepchecks, OmniDocBench and pysad.
The outlier score of each sample is called the Local Outlier Probability. It measures the local deviation of density of a given sample with respect to its neighbors as Local Outlier Factor (LOF), but provides normalized outlier scores in the range [0,1]. These outlier scores are directly interpretable as a probability of an object being an outlier. Since Local Outlier Probabilities provides scores in the range [0,1], practitioners are free to interpret the results according to the application.
Like LOF, it is local in that the anomaly score depends on how isolated the sample is with respect to the surrounding neighborhood. Locality is given by k-nearest neighbors, whose distance is used to estimate the local density. By comparing the local density of a sample to the local densities of its neighbors, one can identify samples that lie in regions of lower density compared to their neighbors and thus identify samples that may be outliers according to their Local Outlier Probability.
The authors' 2009 paper detailing LoOP's theory, formulation, and application is provided by Ludwig-Maximilians University Munich - Institute for Informatics; LoOP: Local Outlier Probabilities.
Implementation
This Python 3 implementation uses Numpy and the formulas outlined in LoOP: Local Outlier Probabilities to calculate the Local Outlier Probability of each sample.
Dependencies
- Python 3.8 - 3.13
- numpy >= 1.16.3
- python-utils >= 2.3.0
- (optional) numba >= 0.45.1
Numba just-in-time (JIT) compiles the function with calculates the Euclidean distance between observations, providing a reduction in computation time (significantly when a large number of observations are scored). Numba is not a requirement and PyNomaly may still be used solely with numpy if desired (details below).
Quick Start
First install the package from the Python Package Index:
pip install PyNomaly # or pip3 install ... if you're using both Python 3 and 2.
Alternatively, you can use conda to install the package from conda-forge:
conda install conda-forge::pynomaly
Then you can do something like this:
from PyNomaly import loop
m = loop.LocalOutlierProbability(data).fit()
scores = m.local_outlier_probabilities
print(scores)
where data is a NxM (N rows, M columns; 2-dimensional) set of data as either a Pandas DataFrame or Numpy array.
LocalOutlierProbability sets the extent (in integer in value of 1, 2, or 3) and n_neighbors (must be greater than 0) parameters with the default values of 3 and 10, respectively. You're free to set these parameters on your own as below:
from PyNomaly import loop
m = loop.LocalOutlierProbability(data, extent=2, n_neighbors=20).fit()
scores = m.local_outlier_probabilities
print(scores)
This implementation of LoOP also includes an optional cluster_labels parameter. This is useful in cases where regions of varying density occur within the same set of data. When using cluster_labels, the Local Outlier Probability of a sample is calculated with respect to its cluster assignment.
from PyNomaly import loop
from sklearn.cluster import DBSCAN
db = DBSCAN(eps=0.6, min_samples=50).fit(data)
m = loop.LocalOutlierProbability(data, extent=2, n_neighbors=20, cluster_labels=list(db.labels_)).fit()
scores = m.local_outlier_probabilities
print(scores)
NOTE: Unless your data is all the same scale, it may be a good idea to normalize your data with z-scores or another normalization scheme prior to using LoOP, especially when working with multiple dimensions of varying scale. Users must also appropriately handle missing values prior to using LoOP, as LoOP does not support Pandas DataFrames or Numpy arrays with missing values.
Utilizing Numba and Progress Bars
It may be helpful to use just-in-time (JIT) compilation in the cases where a lot of
observations are scored. Numba, a JIT compiler for Python, may be used
with PyNomaly by setting use_numba=True:
from PyNomaly import loop
m = loop.LocalOutlierProbability(data, extent=2, n_neighbors=20, use_numba=True, progress_bar=True).fit()
scores = m.local_outlier_probabilities
print(scores)
Numba must be installed if the above to use JIT compilation and improve the
speed of multiple calls to LocalOutlierProbability(), and PyNomaly has been
tested with Numba version 0.45.1. An example of the speed difference that can
be realized with using Numba is avaialble in examples/numba_speed_diff.py.
You may also choose to print progress bars with our without the use of numba
by passing progress_bar=True to the LocalOutlierProbability() method as above.
Choosing Parameters
The extent parameter controls the sensitivity of the scoring in practice. The parameter corresponds to the statistical notion of an outlier defined as an object deviating more than a given lambda (extent) times the standard deviation from the mean. A value of 2 implies outliers deviating more than 2 standard deviations from the mean, and corresponds to 95.0% in the empirical "three-sigma" rule. The appropriate parameter should be selected according to the level of sensitivity needed for the input data and application. The question to ask is whether it is more reasonable to assume outliers in your data are 1, 2, or 3 standard deviations from the mean, and select the value likely most appropriate to your data and application.
The n_neighbors parameter defines the number of neighbors to consider about each sample (neighborhood size) when determining its Local Outlier Probability with respect to the density of the sample's defined neighborhood. The idea number of neighbors to consider is dependent on the input data. However, the notion of an outlier implies it would be considered as such regardless of the number of neighbors considered. One potential approach is to use a number of different neighborhood sizes and average the results for reach observation. Those observations which rank highly with varying neighborhood sizes are more than likely outliers. This is one potential approach of selecting the neighborhood size. Another is to select a value proportional to the number of observations, such an odd-valued integer close to the square root of the number of observations in your data (sqrt(n_observations).
Iris Data Example
We'll be using the well-known Iris dataset to show LoOP's capabilities. There's a few things you'll need for this example beyond the standard prerequisites listed above:
- matplotlib 2.0.0 or greater
- PyDataset 0.2.0 or greater
- scikit-learn 0.18.1 or greater
First, let's import the packages and libraries we will need for this example.
from PyNomaly import loop
import pandas as pd
from pydataset import data
import numpy as np
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
Now let's create two sets of Iris data for scoring; one with clustering and the other without.
# import the data and remove any non-numeric columns
iris = pd.DataFrame(data('iris').drop(columns=['Species']))
Next, let's cluster the data using DBSCAN and generate two sets of scores. On both cases, we will use the default values for both extent (0.997) and n_neighbors (10).
db = DBSCAN(eps=0.9, min_samples=10).fit(iris)
m = loop.LocalOutlierProbability(iris).fit()
scores_noclust = m.local_outlier_probabilities
m_clust = loop.LocalOutlierProbability(iris, cluster_labels=list(db.labels_)).fit()
scores_clust = m_clust.local_outlier_probabilities
Organize the data into two separate Pandas DataFrames.
iris_clust = pd.DataFrame(iris.copy())
iris_clust['scores'] = scores_clust
iris_clust['labels'] = db.labels_
iris['scores'] = scores_noclust
And finally, let's visualize the scores provided by LoOP in both cases (with and without clustering).
fig = plt.figure(figsize=(7, 7))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(iris['Sepal.Width'], iris['Petal.Width'], iris['Sepal.Length'],
c=iris['scores'], cmap='seismic', s=50)
ax.set_xlabel('Sepal.Width')
ax.set_ylabel('Petal.Width')
ax.set_zlabel('Sepal.Length')
plt.show()
plt.clf()
plt.cla()
plt.close()
fig = plt.figure(figsize=(7, 7))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(iris_clust['Sepal.Width'], iris_clust['Petal.Width'], iris_clust['Sepal.Length'],
c=iris_clust['scores'], cmap='seismic', s=50)
ax.set_xlabel('Sepal.Width')
ax.set_ylabel('Petal.Width')
ax.set_zlabel('Sepal
