Pytod
TOD: GPU-accelerated Outlier Detection via Tensor Operations
Install / Use
/learn @yzhao062/PytodREADME
(Py)TOD: GPU-accelerated Outlier Detection via Tensor Operations
Deployment & Documentation & Stats & License
.. image:: https://img.shields.io/pypi/v/pytod.svg?color=brightgreen :target: https://pypi.org/project/pytod/ :alt: PyPI version
.. image:: https://img.shields.io/github/stars/yzhao062/pytod.svg :target: https://github.com/yzhao062/pytod/stargazers :alt: GitHub stars
.. image:: https://img.shields.io/github/forks/yzhao062/pytod.svg?color=blue :target: https://github.com/yzhao062/pytod/network :alt: GitHub forks
.. image:: https://github.com/yzhao062/pytod/actions/workflows/testing.yml/badge.svg :target: https://github.com/yzhao062/pytod/actions/workflows/testing.yml :alt: testing
.. image:: https://img.shields.io/github/license/yzhao062/pytod.svg :target: https://github.com/yzhao062/pytod/blob/master/LICENSE :alt: License
Background: Outlier detection (OD) is a key data mining task for identifying abnormal objects from general samples with numerous high-stake applications including fraud detection and intrusion detection.
We propose TOD, a system for efficient and scalable outlier detection (OD) on distributed multi-GPU machines. A key idea behind TOD is decomposing OD applications into basic tensor algebra operations for GPU acceleration.
Citing TOD\ : Check out the design paper <https://www.andrew.cmu.edu/user/yuezhao2/papers/22-preprint-tod.pdf>_.
If you use TOD in a scientific publication, we would appreciate
citations to the following paper::
@article{zhao2021tod,
title={TOD: GPU-accelerated Outlier Detection via Tensor Operations},
author={Zhao, Yue and Chen, George H and Jia, Zhihao},
journal={arXiv preprint arXiv:2110.14007},
year={2021}
}
or::
Zhao, Y., Chen, G.H. and Jia, Z., 2021. TOD: GPU-accelerated Outlier Detection via Tensor Operations. arXiv preprint arXiv:2110.14007.
One Reason to Use It: ^^^^^^^^^^^^^^^^^^^^^
On average, TOD is 11 times faster than PyOD on a diverse group of OD algorithms!
If you need another reason: it can handle much larger datasets---more than a million sample OD within an hour!
GPU-accelerated Outlier Detection with 5 Lines of Code\ :
.. code-block:: python
# train the COPOD detector
from pytod.models.knn import KNN
clf = KNN() # default GPU device is used
clf.fit(X_train)
# get outlier scores
y_train_scores = clf.decision_scores_ # raw outlier scores on the train data
y_test_scores = clf.decision_function(X_test) # predict raw outlier scores on test
TOD is featured for:
- Unified APIs, detailed documentation, and examples for the easy use (under construction)
- More than 5 different OD algorithms and more are being added
- The support of multi-GPU acceleration
- Advanced techniques including provable quantization and automatic batching
Table of Contents\ :
Installation <#installation>_Implemented Algorithms <#implemented-algorithms>_A Motivating Example PyOD vs. PyTOD <#a-motivating-example-pyod-vs-pytod>_Paper Reproducibility <#paper-reproducibility>_Programming Model Interface <#programming-model-interface>_End-to-end Performance Comparison with PyOD <#end-to-end-performance-comparison-with-pyod>_
Installation ^^^^^^^^^^^^
It is recommended to use pip for installation. Please make sure the latest version is installed, as PyTOD is updated frequently:
.. code-block:: bash
pip install pytod # normal install pip install --upgrade pytod # or update if needed
Alternatively, you could clone and run setup.py file:
.. code-block:: bash
git clone https://github.com/yzhao062/pytod.git cd pyod pip install .
Required Dependencies\ :
- Python 3.6+
- mpmath
- numpy>=1.13
- torch>=1.7 (it is safer if you install by yourself)
- scipy>=0.19.1
- scikit_learn>=0.21
- pyod>=1.0.4 (for comparison)
Implemented Algorithms ^^^^^^^^^^^^^^^^^^^^^^
PyTOD toolkit consists of three major functional groups (to be cleaned up):
(i) Individual Detection Algorithms :
=================== ================== ====================================================================================================== ===== ======================================== Type Abbr Algorithm Year Ref =================== ================== ====================================================================================================== ===== ======================================== Linear Model PCA Principal Component Analysis (the sum of weighted projected distances to the eigenvector hyperplanes) 2003 [#Shyu2003A]_ Proximity-Based LOF Local Outlier Factor 2000 [#Breunig2000LOF]_ Proximity-Based COF Connectivity-Based Outlier Factor 2002 [#Tang2002Enhancing]_ Proximity-Based HBOS Histogram-based Outlier Score 2012 [#Goldstein2012Histogram]_ Proximity-Based kNN k Nearest Neighbors (use the distance to the kth nearest neighbor as the outlier score) 2000 [#Ramaswamy2000Efficient]_ Proximity-Based AvgKNN Average kNN (use the average distance to k nearest neighbors as the outlier score) 2002 [#Angiulli2002Fast]_ Proximity-Based MedKNN Median kNN (use the median distance to k nearest neighbors as the outlier score) 2002 [#Angiulli2002Fast]_ Probabilistic ABOD Angle-Based Outlier Detection 2008 [#Kriegel2008Angle]_ Probabilistic COPOD COPOD: Copula-Based Outlier Detection 2020 [#Li2020COPOD]_ Probabilistic FastABOD Fast Angle-Based Outlier Detection using approximation 2008 [#Kriegel2008Angle]_ =================== ================== ====================================================================================================== ===== ========================================
Code is being released. Watch and star for the latest news!
A Motivating Example PyOD vs. PyTOD! ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
kNN example <https://github.com/yzhao062/pytod/blob/main/examples/knn_example.py>_
shows that how fast and how easy PyTOD is. Take the famous kNN outlier detection as an example:
#. Initialize a kNN detector, fit the model, and make the prediction.
.. code-block:: python
from pytod.models.knn import KNN # kNN detector
# train kNN detector
clf_name = 'KNN'
clf = KNN()
clf.fit(X_train)
.. code-block:: python
# if GPU is not available, use CPU instead
clf = KNN(device='cpu')
clf.fit(X_train)
#. Get the prediction results
.. code-block:: python
# get the prediction label and outlier scores of the training data
y_train_pred = clf.labels_ # binary labels (0: inliers, 1: outliers)
y_train_scores = clf.decision_scores_ # raw outlier scores
#. On a simple laptop, let us see how fast it is in comparison to PyOD for 30,000 samples with 20 features
.. code-block:: python
KNN-PyOD ROC:1.0, precision @ rank n:1.0
Execution time 11.26 seconds
.. code-block:: python
KNN-PyTOD-GPU ROC:1.0, precision @ rank n:1.0
Execution time 2.82 seconds
.. code-block:: python
KNN-PyTOD-CPU ROC:1.0, precision @ rank n:1.0
Execution time 3.36 seconds
It is easy to see, PyTOD shows both better efficiency than PyOD.
Paper Reproducibility ^^^^^^^^^^^^^^^^^^^^^
Datasets: OD benchmark datasets are available at datasets folder <https://github.com/yzhao062/pytod/tree/main/reproducibility/datasets/ODDS>_.
Scripts for reproducibility is available in reproducibility folder <https://github.com/yzhao062/pytod/tree/main/reproducibility>_.
Cleanup is on the way!
Programming Model Interface ^^^^^^^^^^^^^^^^^^^^^^^^^^^
Complex OD algorithms can be abstracted into common tensor operators.
.. image:: https://raw.githubusercontent.com/yzhao062/pytod/master/figs/abstraction.png :target: https://raw.githubusercontent.com/yzhao062/pytod/master/figs/abstraction.png
For instance, ABOD and COPOD can be assembled by the basic tensor operators.
.. image:: https://raw.githubusercontent.com/yzhao062/pytod/master/figs/abstraction_example.png :target: https://raw.githubusercontent.com/yzhao062/pytod/master/figs/abstraction_example.png
End-to-end Performance Comparison with PyOD ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Overall, it is much (on avg. 11 times) faster than PyOD takes way less run time.
.. image:: https://raw.githubusercontent.com/yzhao062/pytod/master/figs/run_time.png :target: https://raw.githubusercontent.com/yzhao062/pytod/master/figs/run_time.png
Reference ^^^^^^^^^
.. [#Aggarwal2015Outlier] Aggarwal, C.C., 2015. Outlier analysis. In Data mining (pp. 237-263). Springer, Cham.
.. [#Aggarwal2015Theoretical] Aggarwal, C.C. and Sathe, S., 2015. Theoretical foundations and algorithms for outlier ensembles.\ ACM SIGKDD Explorations Newsletter\ , 17(1), pp.24-47.
.. [#Aggarwal2017Outlier] Aggarwal, C.C. and Sathe, S., 2017. Outlier ensembles: An introduction. Springer.
.. [#Almardeny2020A] Almardeny, Y., Boujnah, N. and Cleary,
