PKBoost

An Adaptive Gradient Boosting Library

</div>

Built from scratch in Rust, PKBoost (Performance-Based Knowledge Booster) manages changing data distributions in fraud detection with a fraud rate of 0.2%. It shows less than 2% degradation under drift. In comparison, XGBoost experiences a 31.8% drop and LightGBM a 42.5% drop. PKBoost outperforms XGBoost by 10-18% on the Standard dataset when no drift is applied. It employs information theory with Shannon entropy and Newton Raphson to identify shifts in rare events and trigger an adaptive "metamorphosis" for real-time recovery.

"Most boosting libraries overlook concept drift. PKBoost identifies it and evolves to persist."

Perfect for: Multi-class fraud detection, real-time medical diagnosis, anomaly detection in changing environments, or any scenario where data evolves over time and minority classes are critical.

What's New in v2.0

Multi-Class Classification: One-vs-Rest with softmax (92.36% on Dry Bean, 7 classes)
165x Faster Adaptation: Hierarchical Adaptive Boosting (HAB) with selective retraining
2-17x Better Drift Resilience: vs XGBoost/LightGBM on real-world data
45 Production Features: Complete feature list in FEATURES.md
Real-World Validation: Tested on Credit Card, Dry Bean, Iris datasets

See CHANGELOG_V2.md for full details.

Documentation

Python Package Guide - Python API, installation, examples
Benchmark Reproduction - Complete guide to reproduce all results
Drift Benchmark Report - 16 drift scenarios analysis
Scripts Guide - Data preparation and utility scripts
Features List - All 45 production features
Changelog v2.0 - What's new in version 2.0

Quick Start

To use it in Python Please refer to: Python Bindings Guide

And For API's: Python API README

Try Kaggle Notebook: Kaggle Notebook

Clone the repository and build:

git clone https://github.com/Pushp-Kharat1/pkboost.git
cd pkboost
cargo build --release

Run the benchmark:

Use included sample data (already in data/)

ls data/  # Should show creditcard_train.csv, creditcard_val.csv, etc.

Run benchmark

cargo run --release --bin benchmark

Basic Usage

To train and predict (see src/bin/benchmark.rs for a full example):

use pkboost::*;
use csv;
use std::error::Error;

fn main() -> Result<(), Box<dyn Error>> {
    // Load CSV with headers: feature1,feature2,...,Class
    let (x_train, y_train) = load_csv("train.csv")?;
    let (x_val, y_val) = load_csv("val.csv")?;
    let (x_test, y_test) = load_csv("test.csv")?;

    // Auto-configure based on data characteristics
    let mut model = OptimizedPKBoostShannon::auto(&x_train, &y_train);

    // Train with early stopping on validation set
    model.fit(
        &x_train,
        &y_train,
        Some((&x_val, &y_val)),  // Optional validation
        true  // Verbose output
    )?;

    // Predict probabilities (not classes)
    let test_probs = model.predict_proba(&x_test)?;

    // Evaluate
    let pr_auc = calculate_pr_auc(&y_test, &test_probs);
    println!("PR-AUC: {:.4}", pr_auc);

    Ok(())
}

// Helper function (put in your code)
fn load_csv(path: &str) -> Result<(Vec<Vec<f64>>, Vec<f64>), Box<dyn Error>> {
    let mut reader = csv::Reader::from_path(path)?;
    let headers = reader.headers()?.clone();
    let target_col_index = headers.iter().position(|h| h == "Class")
        .ok_or("Class column not found")?;

    let mut features = Vec::new();
    let mut labels = Vec::new();

    for result in reader.records() {
        let record = result?;
        let mut row: Vec<f64> = Vec::new();
        for (i, value) in record.iter().enumerate() {
            if i == target_col_index {
                labels.push(value.parse()?);
            } else {
                let parsed_value = if value.is_empty() {
                    f64::NAN
                } else {
                    value.parse()?
                };
                row.push(parsed_value);
            }
        }
        features.push(row);
    }

    Ok((features, labels))
}

Expected CSV format:

Header row required
Target column named "Class" with binary values (0.0 or 1.0) for classification
For regression, target column can have any continuous values
All other columns treated as numerical features
Empty values treated as NaN (median-imputed)
No categorical support (encode them first)
For data loading examples, see src/bin/*.rs files like benchmark.rs. Supports CSV via csv crate.

Regression usage:

use pkboost::*;

let mut model = PKBoostRegressor::auto(&x_train, &y_train);
model.fit(&x_train, &y_train, Some((&x_val, &y_val)), true)?;
let predictions = model.predict(&x_test)?;

let rmse = calculate_rmse(&y_test, &predictions);
let r2 = calculate_r2(&y_test, &predictions);
println!("RMSE: {:.4}, R²: {:.4}", rmse, r2);

Multi-class usage:

use pkboost::MultiClassPKBoost;

// y_train contains class labels: 0.0, 1.0, 2.0, ...
let mut model = MultiClassPKBoost::new(3);  // 3 classes
model.fit(&x_train, &y_train, None, true)?;

let probs = model.predict_proba(&x_test)?;  // [n_samples, n_classes]
let predictions = model.predict(&x_test)?;  // class indices

let accuracy = predictions.iter().zip(y_test.iter())
    .filter(|(&pred, &true_y)| pred == true_y as usize)
    .count() as f64 / y_test.len() as f64;
println!("Accuracy: {:.2}%", accuracy * 100.0);

Key Features

Extreme Imbalance Handling: Automatic class weighting and MI regularization boost recall on rare positives without reducing precision. Binary classification only.
Adaptive Hyperparameters: auto_tune_principled profiles your dataset for optimal params—no manual tuning needed.
Histogram-Based Trees: Optimized binning with medians for missing values; supports up to 32 bins per feature for fast splits.
Parallelism & Efficiency: Rayon-based adaptive parallelism detects hardware and scales thresholds dynamically. Efficient batching is used for large datasets.
Adaptation Mechanisms: AdversarialLivingBooster monitors vulnerability scores to detect drift and trigger retraining, such as pruning unused features through "metabolism" tracking.
Metrics Built-In: PR-AUC, ROC-AUC, F1@0.5, and threshold optimization are available out-of-the-box.
For full mathematical derivations, Refer to: Math.pdf

Benchmarks

Testing methodology: All models use default settings with no hyperparameter tuning. This reflects real-world usage where most practitioners cannot dedicate time to extensive tuning.

PKBoost's auto-tuning provides an edge—it automatically detects imbalance and adjusts parameters. LGBM/XGB can match these results with tuning but require expert knowledge.

Reproducibility: All benchmark code is in src/bin/benchmark.rs. Data splits: 60% train, 20% val, 20% test. LGBM/XGB used default params from their Rust crates. Full benchmarks (10+ datasets): See BENCHMARKS.md.

Standard Datasets

| Dataset | Samples | Imbalance | Model | PR-AUC | F1-AUC | ROC-AUC | |------------------|----------|----------------------|-----------|---------|---------|---------| | Credit Card | 170,884 | 0.2% (extreme) | PKBoost | 87.8% | 87.4% | 97.5% | | | | | LightGBM | 79.3% | 71.3% | 92.1% | | | | | XGBoost | 74.5% | 79.8% | 91.7% | | Improvements | | | vs LGBM | +10.4% | +22.7% | +5.7% | | | | | vs XGBoost| +17.9% | +9.7% | +6.1% | | Pima Diabetes | 460 | 35.0% (balanced) | PKBoost | 98.0% | 93.7% | 98.6% | | | | | LightGBM | 62.9% | 48.8% | 82.4% | | | | | XGBoost | 68.0% | 60.0% | 82.0% | | Improvements | | | vs LGBM | +55.7% | +92.0% | +19.6% | | | | | vs XGBoost| +44.0% | +56.1% | +20.1% | | Breast Cancer | 341 | 37.2% (balanced) | PKBoost | 97.9% | 93.2% | 98.6% | | | | | LightGBM | 99.1% | 96.3% | 99.2% | | | | | XGBoost | 99.2% | 95.1% | 99.4% | |

PkBoost

Install / Use

README