MLstatkit

PyPI - Version GitHub release (latest by date) PyPI - Status PyPI - Wheel PyPI - Python Version PyPI - Download

MLstatkit is a Python library that integrates established statistical methods into modern machine learning workflows.
It provides a set of core functions widely used for model evaluation and statistical inference:

DeLong's test (Delong_test) for comparing the AUCs of two correlated ROC curves.
Bootstrapping (Bootstrapping) for estimating confidence intervals of metrics such as ROC-AUC, F1-score, accuracy, precision, recall, and PR-AUC.
Permutation test (Permutation_test) for evaluating whether performance differences between two models are statistically significant.
AUC to Odds Ratio conversion (AUC2OR) for interpreting ROC-AUC values in terms of odds ratios and related effect size statistics.

Since v0.1.9, the library has been modularized into dedicated files (ci.py, conversions.py, delong.py, metrics.py, permutation.py), while keeping a unified import interface through stats.py. This improves readability, maintainability, and extensibility for future methods.

Installation

Install MLstatkit directly from PyPI using pip:

pip install MLstatkit

Usage

Delong's Test for ROC Curve

Delong_test function enables a statistical evaluation of the differences between the areas under two correlated Receiver Operating Characteristic (ROC) curves derived from distinct models. This facilitates a deeper understanding of comparative model performance.
Since version 0.1.8, the function also supports returning confidence intervals (CIs) for the AUCs of both models, similar to the functionality of roc.test in R.

Parameters (DeLong’s Test)

true : array-like of shape (n_samples,) Binary ground truth labels in {0, 1}.
prob_A, prob_B : array-like of shape (n_samples,) Scores or probabilities for the positive class from models A and B.
alpha : float, default=0.95 Confidence level for the AUC confidence intervals (normal approximation, clipped to [0, 1]).
return_ci : bool, default=True If True, return (ci_A, ci_B) for model A and B AUCs.
return_auc : bool, default=True If True, include (auc_A, auc_B) in the return tuple in addition to z and p (and optionally (ci_A, ci_B) if return_ci=True).
n_boot : int, default=5000 Number of bootstrap resamples used if the fallback path is triggered.
random_state : int or None, default=None RNG seed for reproducibility in the bootstrap path (not used for standard DeLong).
verbose : {0, 1, 2}, default=0
- 0: Silent
- 1: Key steps (sample counts, method, z/p, CIs)
- 2: Detailed (includes var_diff, raw differences, and optional bootstrap progress)
progress_every : int, default=0 When verbose >= 2, print bootstrap progress every N iterations; 0 disables progress output.

Returns (DeLong's test)

Depending on return_ci and return_auc, the function returns different tuples:

return_ci=False, return_auc=False → (z, p_value)
return_ci=True, return_auc=False → (z, p_value, ci_A, ci_B)
return_ci=False, return_auc=True → (z, p_value, auc_A, auc_B)
return_ci=True, return_auc=True → (z, p_value, ci_A, ci_B, auc_A, auc_B, info)

Where info (present only when both return_ci and return_auc are True) is a dict containing:

method: "delong" or "bootstrap"
var_diff: variance of the AUC difference (if DeLong was used)
tie_rate_A, tie_rate_B: tie proportions in model A and B scores
n_pos, n_neg: class counts
n_boot: number of effective bootstrap samples (if bootstrap)
messages: list of (level, message) logs captured during the run

Example (DeLong’s Test)

Example 1 --- Minimal usage (z and p only)

from MLstatkit import Delong_test
import numpy as np

true   = np.array([0, 1, 0, 1])
prob_A = np.array([0.10, 0.40, 0.35, 0.80])
prob_B = np.array([0.20, 0.30, 0.40, 0.70])

z, p = Delong_test(true, prob_A, prob_B, return_ci=False, return_auc=False, verbose=0)
print(f"z = {z:.6f}, p = {p:.3e}")

Example 2 --- AUCs and 95% CIs (with method info)

z, p, ci_A, ci_B, auc_A, auc_B, info = Delong_test(
    true, prob_A, prob_B,
    alpha=0.95, return_ci=True, return_auc=True, verbose=1
)

print(f"Method   : {info['method']}")
print(f"AUC_A    : {auc_A:.4f}, CI_A = {ci_A}")
print(f"AUC_B    : {auc_B:.4f}, CI_B = {ci_B}")
print(f"z-score  : {z:.4f}, p-value = {p:.3e}")

Example 3 — Degenerate case (forces bootstrap fallback)

# Perfect separation for A, completely reversed scores for B
true   = np.array([0, 1] * 50)
prob_A = true.astype(float)        # Model A: perfect AUC = 1.0
prob_B = 1 - true.astype(float)    # Model B: worst AUC = 0.0

z, p, ci_A, ci_B, auc_A, auc_B, info = Delong_test(
    true, prob_A, prob_B,
    alpha=0.95, return_ci=True, return_auc=True,
    n_boot=2000, random_state=42, verbose=2, progress_every=500
)

print("--- Bootstrap fallback example ---")
print(f"Method   : {info['method']} (auto-fallback expected)")
print(f"AUC_A    : {auc_A:.4f}, CI_A = {ci_A}")
print(f"AUC_B    : {auc_B:.4f}, CI_B = {ci_B}")
print(f"z-score  : {z}, p-value = {p:.3e}")

Bootstrapping for Confidence Intervals

The Bootstrapping function calculates confidence intervals (CIs) for specified performance metrics using bootstrapping, providing a measure of the estimation's reliability. It supports calculation for AUROC (area under the ROC curve), AUPRC (area under the precision-recall curve), and F1 score metrics.

Parameters（Bootstrapping）

true : array-like of shape (n_samples,)
True binary labels, where the labels are either {0, 1}.
prob : array-like of shape (n_samples,)
Predicted probabilities, as returned by a classifier's predict_proba method, or binary predictions based on the specified scoring function and threshold.
metric_str : str, default='f1'
Identifier for the scoring function to use. Supported values include 'f1', 'accuracy', 'recall', 'precision', 'roc_auc', 'pr_auc', and 'average_precision'.
n_bootstraps : int, default=1000
The number of bootstrap iterations to perform. Increasing this number improves the reliability of the confidence interval estimation but also increases computational time.
confidence_level : float, default=0.95
The confidence level for the interval estimation. For instance, 0.95 represents a 95% confidence interval.
threshold : float, default=0.5
A threshold value used for converting probabilities to binary labels for metrics like 'f1', where applicable.
average : str, default='macro'
Specifies the method of averaging to apply to multi-class/multi-label targets. Other options include 'micro', 'samples', 'weighted', and 'binary'.
random_state : int, default=0
Seed for the random number generator. This parameter ensures reproducibility of results.

Returns（Bootstrapping）

original_score : float
Metric score on the original (non-resampled) dataset.
confidence_lower : float
Lower bound of the bootstrap confidence interval.
confidence_upper : float
Upper bound of the bootstrap confidence interval.

Examples（Bootstrapping）

from MLstatkit import Bootstrapping

# Example data
y_true = np.array([0, 1, 0, 0, 1, 1, 0, 1, 0])
y_prob = np.array([0.1, 0.4, 0.35, 0.8, 0.2, 0.3, 0.4, 0.7, 0.05])

# Calculate confidence intervals for AUROC
original_score, confidence_lower, confidence_upper = Bootstrapping(y_true, y_prob, 'roc_auc')
print(f"AUROC: {original_score:.3f}, Confidence interval: [{confidence_lower:.3f} - {confidence_upper:.3f}]")

# Calculate confidence intervals for AUPRC
original_score, confidence_lower, confidence_upper = Bootstrapping(y_true, y_prob, 'pr_auc')
print(f"AUPRC: {original_score:.3f}, Confidence interval: [{confidence_lower:.3f} - {confidence_upper:.3f}]")

# Calculate confidence intervals for F1 score with a custom threshold
original_score, confidence_lower, confidence_upper = Bootstrapping(y_true, y_prob, 'f1', threshold=0.5)
print(f"F1 Score: {original_score:.3f}, Confidence interval: [{confidence_lower:.3f} - {confidence_upper:.3f}]")

# Loop through multiple metrics
for score in ['roc_auc', 'pr_auc', 'f1']:
    original_score, conf_lower, conf_upper = Bootstrapping(y_true, y_prob, score, threshold=0.5)
    print(f"{score.upper()} original score: {original_score:.3f}, confidence interval: [{conf_lower:.3f} - {conf_upper:.3f}]")

Permutation Test for Statistical Significance

The Permutation_test function evaluates whether the observed difference in performance between two models is statistically significant.
It works by randomly shuffling the predictions between the models and recalculating the chosen metric many times to generate a null distribution of differences.
This approach makes no assumptions about the underlying distribution of the data, making it a robust method for model comparison.

Parameters

y_true : array-like of shape (n_samples,)
True binary labels in {0, 1}.
prob_model_A : array-like of shape (n_samples,)
Predicted probabilities from the first model.
prob_model_B : array-like of shape (n_samples,)
Predicted probabilities from the second model.
metric_str : str, default='f1'
Metric to compare. Supported: 'f1', 'accuracy', `

MLstatkit

Install / Use

README

MLstatkit

Installation

Usage

Delong's Test for ROC Curve

Parameters (DeLong’s Test)

Returns (DeLong's test)

Example (DeLong’s Test)

Bootstrapping for Confidence Intervals

Parameters（Bootstrapping）

Returns（Bootstrapping）

Examples（Bootstrapping）

Permutation Test for Statistical Significance

Parameters