SkillAgentSearch skills...

Rustystats

No description available

Install / Use

/learn @PricingFrontier/Rustystats
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

RustyStats 🦀📊

CI PyPI Rust License: AGPL-3.0

High-performance Generalized Linear Models with a Rust backend and Python API

Codebase Documentation: pricingfrontier.github.io/rustystats/

Features

  • Dict-First API - Programmatic model building ideal for automated workflows and agents
  • Fast - Parallel Rust backend for high-throughput fitting
  • Memory Efficient - Low memory footprint at scale
  • Stable - Step-halving IRLS, warm starts for robust convergence
  • Splines - B-splines and natural splines with auto-tuned smoothing and monotonicity
  • Target Encoding - Ordered target encoding for high-cardinality categoricals
  • Regularisation - Ridge, Lasso, and Elastic Net via coordinate descent
  • Lasso Credibility - Shrink toward a prior model instead of zero (CAS Monograph 13)
  • Validation - Design matrix checks with fix suggestions before fitting
  • Complete - 8 families, robust SEs, full diagnostics, VIF, partial dependence
  • Minimal - Only numpy and polars required

Installation

uv add rustystats

Quick Start

import rustystats as rs
import polars as pl

# Load data
data = pl.read_parquet("insurance.parquet")

# Fit a Poisson GLM for claim frequency
result = rs.glm_dict(
    response="ClaimCount",
    terms={
        "VehAge": {"type": "linear"},
        "VehPower": {"type": "linear"},
        "Area": {"type": "categorical"},
        "Region": {"type": "categorical"},
    },
    data=data,
    family="poisson",
    offset="Exposure",
).fit()

# View results
print(result.summary())

Families & Links

| Family | Default Link | Use Case | |--------|--------------|----------| | gaussian | identity | Linear regression | | poisson | log | Claim frequency | | binomial | logit | Binary outcomes | | gamma | log | Claim severity | | tweedie | log | Pure premium (var_power=1.5) | | quasipoisson | log | Overdispersed counts | | quasibinomial | logit | Overdispersed binary | | negbinomial | log | Overdispersed counts (proper distribution) |


Dict-Based API

API built for programmatic model building.

result = rs.glm_dict(
    response="ClaimCount",
    terms={
        "VehAge": {"type": "bs", "monotonicity": "increasing"},  # Monotonic (auto-tuned)
        "DrivAge": {"type": "bs"},                               # Penalized smooth (default)
        "Income": {"type": "bs", "df": 5},                       # Fixed 5 df
        "BonusMalus": {"type": "linear", "monotonicity": "increasing"},  # Constrained coefficient
        "Region": {"type": "categorical"},
        "Brand": {"type": "target_encoding"},
        "Age2": {"type": "expression", "expr": "DrivAge**2"},
    },
    interactions=[
        {
            "VehAge": {"type": "linear"}, 
            "Region": {"type": "categorical"}, 
            "include_main": True
        },
    ],
    data=data,
    family="poisson",
    offset="Exposure",
    seed=42,
).fit(regularization="elastic_net")

Term Types

| Type | Parameters | Description | |------|------------|-------------| | linear | monotonicity (optional) | Raw continuous variable | | categorical | levels (optional) | Dummy encoding | | bs | df or k, knots, boundary_knots, degree=3, monotonicity | B-spline (default: penalized smooth, k=10) | | ns | df or k, knots, boundary_knots | Natural spline (default: penalized smooth, k=10) | | target_encoding | prior_weight=1 | Regularized target encoding | | expression | expr, monotonicity (optional) | Arbitrary expression (like I()) |

Interactions

Each interaction is a dict with variable specs. Use include_main to also add main effects.

interactions=[
    # Standard interaction: product terms (main effects + interaction)
    {
        "DrivAge": {"type": "bs", "df": 5}, 
        "Brand": {"type": "target_encoding"},
        "include_main": True
    },
    # Categorical × continuous (interaction only)
    {
        "VehAge": {"type": "linear"}, 
        "Region": {"type": "categorical"}, 
        "include_main": False
    },
    # TE interaction: combined target encoding TE(Brand:Region)
    {
        "Brand": {"type": "categorical"},
        "Region": {"type": "categorical"},
        "target_encoding": True,
        "prior_weight": 1.0,  # optional
    },
    # FE interaction: combined frequency encoding FE(Brand:Region)
    {
        "Brand": {"type": "categorical"},
        "Region": {"type": "categorical"},
        "frequency_encoding": True,
    },
]

| Flag | Effect | |------|--------| | (none) | Standard product terms (cat×cat, cat×cont, etc.) | | target_encoding: True | Combined TE encoding: TE(var1:var2) | | frequency_encoding: True | Combined FE encoding: FE(var1:var2) |


Splines

# Default: penalized smooth with automatic tuning via GCV
result = rs.glm_dict(
    response="ClaimNb",
    terms={
        "Age": {"type": "bs"},           # B-spline (auto-tuned)
        "VehPower": {"type": "ns"},      # Natural spline (auto-tuned)
        "Region": {"type": "categorical"},
    },
    data=data, family="poisson", offset="Exposure",
).fit()

# Fixed degrees of freedom (no penalty)
result = rs.glm_dict(
    response="ClaimNb",
    terms={
        "Age": {"type": "bs", "df": 5},       # Fixed 5 df
        "VehPower": {"type": "ns", "df": 4},  # Fixed 4 df
        "Region": {"type": "categorical"},
    },
    data=data, family="poisson", offset="Exposure",
).fit()

Spline parameters:

  • No parameters → penalized smooth with automatic tuning (k=10)
  • df=5 → fixed 5 degrees of freedom
  • k=15 → penalized smooth with 15 basis functions
  • knots=[2.0, 5.0, 8.0] → explicit interior knot positions (mutually exclusive with df/k)
  • boundary_knots=(0.0, 10.0) → custom boundary knots (optional, defaults to data range)
  • monotonicity="increasing" or "decreasing" → constrained effect (bs only)

When to use each type:

  • B-splines (bs): Standard choice, more flexible at boundaries, supports monotonicity
  • Natural splines (ns): Better extrapolation, linear beyond boundaries

Monotonic Splines

Constrain the fitted curve to be monotonically increasing or decreasing. Essential when business logic dictates a monotonic relationship.

# Monotonically increasing effect (e.g., age → risk)
result = rs.glm_dict(
    response="ClaimNb",
    terms={
        "Age": {"type": "bs", "monotonicity": "increasing"},
        "Region": {"type": "categorical"},
    },
    data=data, family="poisson", offset="Exposure",
).fit()

# Monotonically decreasing effect (e.g., vehicle value with age)
result = rs.glm_dict(
    response="ClaimAmt",
    terms={"VehAge": {"type": "bs", "df": 4, "monotonicity": "decreasing"}},
    data=data, family="gamma",
).fit()

Coefficient Constraints

Constrain coefficient signs using monotonicity on linear and expression terms.

result = rs.glm_dict(
    response="y",
    terms={
        "age": {"type": "linear", "monotonicity": "increasing"},  # β ≥ 0
        "age2": {"type": "expression", "expr": "age ** 2", "monotonicity": "decreasing"},  # β ≤ 0
        "income": {"type": "linear"},
    },
    data=data, family="poisson",
).fit()

| Constraint | Term Spec | Effect | |------------|-----------|--------| | β ≥ 0 | "monotonicity": "increasing" | Positive effect | | β ≤ 0 | "monotonicity": "decreasing" | Negative effect |


Target Encoding

Ordered target encoding for high-cardinality categoricals.

# Dict API
result = rs.glm_dict(
    response="ClaimNb",
    terms={
        "Brand": {"type": "target_encoding"},
        "Model": {"type": "target_encoding", "prior_weight": 2.0},
        "Age": {"type": "linear"},
        "Region": {"type": "categorical"},
    },
    data=data, family="poisson", offset="Exposure",
).fit()

# Sklearn-style API
encoder = rs.TargetEncoder(prior_weight=1.0, n_permutations=4)
train_encoded = encoder.fit_transform(train_categories, train_target)
test_encoded = encoder.transform(test_categories)

Key benefits:

  • No target leakage: Ordered target statistics
  • Regularization: Prior weight controls shrinkage toward global mean
  • High-cardinality: Single column instead of thousands of dummies
  • Exposure-aware: For frequency models with offset="Exposure", automatically uses claim rate (ClaimCount/Exposure) instead of raw counts
  • Interactions: Use target_encoding: True in interactions to encode variable combinations

Expression Terms

result = rs.glm_dict(
    response="y",
    terms={
        "age": {"type": "linear"},
        "age2": {"type": "expression", "expr": "age ** 2"},
        "age3": {"type": "expression", "expr": "age ** 3"},
        "income_k": {"type": "expression", "expr": "income / 1000"},
        "bmi": {"type": "expression", "expr": "weight / (height ** 2)"},
    },
    data=data, family="gaussian",
).fit()

Supported operations: +, -, *, /, ** (power)


Regularization

CV-Based Regularization

# Just specify regularization type - cv=5 is automatic
result = rs.glm_dict(
    response="y",
    terms={"x1": {"type": "linear"}, "x2": {"type": "linear"}, "cat": {"type": "categorical"}},
    data=data,
    family="poisson",
).fit(regularization="ridge")  # "ridge", "lasso", or "elastic_net"

print(f"Selected alpha: {result.alpha}")
print(f"CV deviance: {result.cv_deviance}")

**Options:

Related Skills

View on GitHub
GitHub Stars6
CategoryDevelopment
Updated9d ago
Forks1

Languages

HTML

Security Score

70/100

Audited on Mar 22, 2026

No findings