Frouros is a Python library for drift detection in machine learning systems that provides a combination of classical and more recent algorithms for both concept and data drift detection.

<p align="center"> <i> "Everything changes and nothing stands still" </i> </p> <p align="center"> <i> "You could not step twice into the same river" </i> </p> <div align="center" style="width: 70%;"> <p align="right"> <i> Heraclitus of Ephesus (535-475 BCE.) </i> </p> </div>

⚡️ Quickstart

🔄 Concept drift

As a quick example, we can use the breast cancer dataset to which concept drift it is induced and show the use of a concept drift detector like DDM (Drift Detection Method). We can see how concept drift affects the performance in terms of accuracy.

import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

from frouros.detectors.concept_drift import DDM, DDMConfig
from frouros.metrics import PrequentialError

np.random.seed(seed=31)

# Load breast cancer dataset
X, y = load_breast_cancer(return_X_y=True)

# Split train (70%) and test (30%)
(
    X_train,
    X_test,
    y_train,
    y_test,
) = train_test_split(X, y, train_size=0.7, random_state=31)

# Define and fit model
pipeline = Pipeline(
    [
        ("scaler", StandardScaler()),
        ("model", LogisticRegression()),
    ]
)
pipeline.fit(X=X_train, y=y_train)

# Detector configuration and instantiation
config = DDMConfig(
    warning_level=2.0,
    drift_level=3.0,
    min_num_instances=25,  # minimum number of instances before checking for concept drift
)
detector = DDM(config=config)

# Metric to compute accuracy
metric = PrequentialError(alpha=1.0)  # alpha=1.0 is equivalent to normal accuracy

def stream_test(X_test, y_test, y, metric, detector):
    """Simulate data stream over X_test and y_test. y is the true label."""
    drift_flag = False
    for i, (X, y) in enumerate(zip(X_test, y_test)):
        y_pred = pipeline.predict(X.reshape(1, -1))
        error = 1 - (y_pred.item() == y.item())
        metric_error = metric(error_value=error)
        _ = detector.update(value=error)
        status = detector.status
        if status["drift"] and not drift_flag:
            drift_flag = True
            print(f"Concept drift detected at step {i}. Accuracy: {1 - metric_error:.4f}")
    if not drift_flag:
        print("No concept drift detected")
    print(f"Final accuracy: {1 - metric_error:.4f}\n")

# Simulate data stream (assuming test label available after each prediction)
# No concept drift is expected to occur
stream_test(
    X_test=X_test,
    y_test=y_test,
    y=y,
    metric=metric,
    detector=detector,
)
# >> No concept drift detected
# >> Final accuracy: 0.9766

# IMPORTANT: Induce/simulate concept drift in the last part (20%)
# of y_test by modifying some labels (50% approx). Therefore, changing P(y|X))
drift_size = int(y_test.shape[0] * 0.2)
y_test_drift = y_test[-drift_size:]
modify_idx = np.random.rand(*y_test_drift.shape) <= 0.5
y_test_drift[modify_idx] = (y_test_drift[modify_idx] + 1) % len(np.unique(y_test))
y_test[-drift_size:] = y_test_drift

# Reset detector and metric
detector.reset()
metric.reset()

# Simulate data stream (assuming test label available after each prediction)
# Concept drift is expected to occur because of the label modification
stream_test(
    X_test=X_test,
    y_test=y_test,
    y=y,
    metric=metric,
    detector=detector,
)
# >> Concept drift detected at step 142. Accuracy: 0.9510
# >> Final accuracy: 0.8480

More concept drift examples can be found here.

📊 Data drift

As a quick example, we can use the iris dataset to which data drift is induced and show the use of a data drift detector like Kolmogorov-Smirnov test.

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

from frouros.detectors.data_drift import KSTest

np.random.seed(seed=31)

# Load iris dataset
X, y = load_iris(return_X_y=True)

# Split train (70%) and test (30%)
(
    X_train,
    X_test,
    y_train,
    y_test,
) = train_test_split(X, y, train_size=0.7, random_state=31)

# Set the feature index to which detector is applied
feature_idx = 0

# IMPORTANT: Induce/simulate data drift in the selected feature of y_test by
# applying some gaussian noise. Therefore, changing P(X))
X_test[:, feature_idx] += np.random.normal(
    loc=0.0,
    scale=3.0,
    size=X_test.shape[0],
)

# Define and fit model
model = DecisionTreeClassifier(random_state=31)
model.fit(X=X_train, y=y_train)

# Set significance level for hypothesis testing
alpha = 0.001
# Define and fit detector
detector = KSTest()
_ = detector.fit(X=X_train[:, feature_idx])

# Apply detector to the selected feature of X_test
result, _ = detector.compare(X=X_test[:, feature_idx])

# Check if drift is taking place
if result.p_value <= alpha:
    print(f"Data drift detected at feature {feature_idx}")
else:
    print(f"No data drift detected at feature {feature_idx}")
# >> Data drift detected at feature 0
# Therefore, we can reject H0 (both samples come from the same distribution).

More data drift examples can be found here.

🛠 Installation

Frouros can be installed via pip:

pip install frouros

🕵🏻‍♂️️ Drift detection methods

The currently implemented detectors are listed in the following table.

<table style="width: 100%; text-align: center; border-collapse: collapse; border: 1px solid grey;"> <thead> <tr> <th style="text-align: center; border: 1px solid grey; padding: 4px;">Drift detector</th> <th style="text-align: center; border: 1px solid grey; padding: 4px;">Type</th> <th style="text-align: center; border: 1px solid grey; padding: 4px;">Family</th> <th style="text-align: center; border: 1px solid grey; padding: 4px;">Univariate (U) / Multivariate (M)</th> <th style="text-align: center; border: 1px solid grey; padding: 4px;">Numerical (N) / Categorical (C)</th> <th style="text-align: center; border: 1px solid grey; padding: 4px;">Method</th> <th style="text-align: center; border: 1px solid grey; padding: 4px;">Reference</th> </tr> </thead> <tbody> <tr> <td rowspan="13" style="text-align: center; border: 1px solid grey; padding: 8px;">Concept drift</td> <td rowspan="13" style="text-align: center; border: 1px solid grey; padding: 8px;">Streaming</td> <td rowspan="4" style="text-align: center; border: 1px solid grey; padding: 8px;">Change detection</td> <td style="text-align: center; border: 1px solid grey; padding: 8px;">U</td> <td style="text-align: center; border: 1px solid grey; padding: 8px;">N</td> <td style="text-align: center; border: 1px solid grey; padding: 8px;">BOCD</td> <td style="text-align: center; border: 1px solid grey; padding: 8px;"><a href="https://doi.org/10.48550/arXiv.0710.3742">Adams and MacKay (2007)</a></td> </tr> <tr> <td style="text-align: center; border: 1px solid grey; padding: 8px;">U</td> <td style="text-align: center; border: 1px solid grey; padding: 8px;">N</td> <td style="text-align: center; border: 1px solid grey; padding: 8px;">CUSUM</td> <td style="text-align: center; border: 1px solid grey; padding: 8px;"><a href="https://doi.org/10.2307/2333009">Page (1954)</a></td> </tr> <tr> <td style="text-align: center; border: 1px solid grey; padding: 8px;">U</td> <td style="text-align: center; border: 1px solid grey; padding: 8px;">N</td> <td style="text-align: center; border: 1px solid grey; padding: 8px;">Geometric moving average</td> <td style="text-align: center; border: 1px solid grey; padding: 8px;"><a href="https://doi.org/10.2307/1266443">Roberts (195

Frouros

Install / Use

README

⚡️ Quickstart

🔄 Concept drift

📊 Data drift

🛠 Installation

🕵🏻‍♂️️ Drift detection methods