Rustystats
No description available
Install / Use
/learn @PricingFrontier/RustystatsREADME
RustyStats 🦀📊
High-performance Generalized Linear Models with a Rust backend and Python API
Codebase Documentation: pricingfrontier.github.io/rustystats/
Features
- Dict-First API - Programmatic model building ideal for automated workflows and agents
- Fast - Parallel Rust backend for high-throughput fitting
- Memory Efficient - Low memory footprint at scale
- Stable - Step-halving IRLS, warm starts for robust convergence
- Splines - B-splines and natural splines with auto-tuned smoothing and monotonicity
- Target Encoding - Ordered target encoding for high-cardinality categoricals
- Regularisation - Ridge, Lasso, and Elastic Net via coordinate descent
- Lasso Credibility - Shrink toward a prior model instead of zero (CAS Monograph 13)
- Validation - Design matrix checks with fix suggestions before fitting
- Complete - 8 families, robust SEs, full diagnostics, VIF, partial dependence
- Minimal - Only
numpyandpolarsrequired
Installation
uv add rustystats
Quick Start
import rustystats as rs
import polars as pl
# Load data
data = pl.read_parquet("insurance.parquet")
# Fit a Poisson GLM for claim frequency
result = rs.glm_dict(
response="ClaimCount",
terms={
"VehAge": {"type": "linear"},
"VehPower": {"type": "linear"},
"Area": {"type": "categorical"},
"Region": {"type": "categorical"},
},
data=data,
family="poisson",
offset="Exposure",
).fit()
# View results
print(result.summary())
Families & Links
| Family | Default Link | Use Case |
|--------|--------------|----------|
| gaussian | identity | Linear regression |
| poisson | log | Claim frequency |
| binomial | logit | Binary outcomes |
| gamma | log | Claim severity |
| tweedie | log | Pure premium (var_power=1.5) |
| quasipoisson | log | Overdispersed counts |
| quasibinomial | logit | Overdispersed binary |
| negbinomial | log | Overdispersed counts (proper distribution) |
Dict-Based API
API built for programmatic model building.
result = rs.glm_dict(
response="ClaimCount",
terms={
"VehAge": {"type": "bs", "monotonicity": "increasing"}, # Monotonic (auto-tuned)
"DrivAge": {"type": "bs"}, # Penalized smooth (default)
"Income": {"type": "bs", "df": 5}, # Fixed 5 df
"BonusMalus": {"type": "linear", "monotonicity": "increasing"}, # Constrained coefficient
"Region": {"type": "categorical"},
"Brand": {"type": "target_encoding"},
"Age2": {"type": "expression", "expr": "DrivAge**2"},
},
interactions=[
{
"VehAge": {"type": "linear"},
"Region": {"type": "categorical"},
"include_main": True
},
],
data=data,
family="poisson",
offset="Exposure",
seed=42,
).fit(regularization="elastic_net")
Term Types
| Type | Parameters | Description |
|------|------------|-------------|
| linear | monotonicity (optional) | Raw continuous variable |
| categorical | levels (optional) | Dummy encoding |
| bs | df or k, knots, boundary_knots, degree=3, monotonicity | B-spline (default: penalized smooth, k=10) |
| ns | df or k, knots, boundary_knots | Natural spline (default: penalized smooth, k=10) |
| target_encoding | prior_weight=1 | Regularized target encoding |
| expression | expr, monotonicity (optional) | Arbitrary expression (like I()) |
Interactions
Each interaction is a dict with variable specs. Use include_main to also add main effects.
interactions=[
# Standard interaction: product terms (main effects + interaction)
{
"DrivAge": {"type": "bs", "df": 5},
"Brand": {"type": "target_encoding"},
"include_main": True
},
# Categorical × continuous (interaction only)
{
"VehAge": {"type": "linear"},
"Region": {"type": "categorical"},
"include_main": False
},
# TE interaction: combined target encoding TE(Brand:Region)
{
"Brand": {"type": "categorical"},
"Region": {"type": "categorical"},
"target_encoding": True,
"prior_weight": 1.0, # optional
},
# FE interaction: combined frequency encoding FE(Brand:Region)
{
"Brand": {"type": "categorical"},
"Region": {"type": "categorical"},
"frequency_encoding": True,
},
]
| Flag | Effect |
|------|--------|
| (none) | Standard product terms (cat×cat, cat×cont, etc.) |
| target_encoding: True | Combined TE encoding: TE(var1:var2) |
| frequency_encoding: True | Combined FE encoding: FE(var1:var2) |
Splines
# Default: penalized smooth with automatic tuning via GCV
result = rs.glm_dict(
response="ClaimNb",
terms={
"Age": {"type": "bs"}, # B-spline (auto-tuned)
"VehPower": {"type": "ns"}, # Natural spline (auto-tuned)
"Region": {"type": "categorical"},
},
data=data, family="poisson", offset="Exposure",
).fit()
# Fixed degrees of freedom (no penalty)
result = rs.glm_dict(
response="ClaimNb",
terms={
"Age": {"type": "bs", "df": 5}, # Fixed 5 df
"VehPower": {"type": "ns", "df": 4}, # Fixed 4 df
"Region": {"type": "categorical"},
},
data=data, family="poisson", offset="Exposure",
).fit()
Spline parameters:
- No parameters → penalized smooth with automatic tuning (k=10)
df=5→ fixed 5 degrees of freedomk=15→ penalized smooth with 15 basis functionsknots=[2.0, 5.0, 8.0]→ explicit interior knot positions (mutually exclusive withdf/k)boundary_knots=(0.0, 10.0)→ custom boundary knots (optional, defaults to data range)monotonicity="increasing"or"decreasing"→ constrained effect (bs only)
When to use each type:
- B-splines (
bs): Standard choice, more flexible at boundaries, supports monotonicity - Natural splines (
ns): Better extrapolation, linear beyond boundaries
Monotonic Splines
Constrain the fitted curve to be monotonically increasing or decreasing. Essential when business logic dictates a monotonic relationship.
# Monotonically increasing effect (e.g., age → risk)
result = rs.glm_dict(
response="ClaimNb",
terms={
"Age": {"type": "bs", "monotonicity": "increasing"},
"Region": {"type": "categorical"},
},
data=data, family="poisson", offset="Exposure",
).fit()
# Monotonically decreasing effect (e.g., vehicle value with age)
result = rs.glm_dict(
response="ClaimAmt",
terms={"VehAge": {"type": "bs", "df": 4, "monotonicity": "decreasing"}},
data=data, family="gamma",
).fit()
Coefficient Constraints
Constrain coefficient signs using monotonicity on linear and expression terms.
result = rs.glm_dict(
response="y",
terms={
"age": {"type": "linear", "monotonicity": "increasing"}, # β ≥ 0
"age2": {"type": "expression", "expr": "age ** 2", "monotonicity": "decreasing"}, # β ≤ 0
"income": {"type": "linear"},
},
data=data, family="poisson",
).fit()
| Constraint | Term Spec | Effect |
|------------|-----------|--------|
| β ≥ 0 | "monotonicity": "increasing" | Positive effect |
| β ≤ 0 | "monotonicity": "decreasing" | Negative effect |
Target Encoding
Ordered target encoding for high-cardinality categoricals.
# Dict API
result = rs.glm_dict(
response="ClaimNb",
terms={
"Brand": {"type": "target_encoding"},
"Model": {"type": "target_encoding", "prior_weight": 2.0},
"Age": {"type": "linear"},
"Region": {"type": "categorical"},
},
data=data, family="poisson", offset="Exposure",
).fit()
# Sklearn-style API
encoder = rs.TargetEncoder(prior_weight=1.0, n_permutations=4)
train_encoded = encoder.fit_transform(train_categories, train_target)
test_encoded = encoder.transform(test_categories)
Key benefits:
- No target leakage: Ordered target statistics
- Regularization: Prior weight controls shrinkage toward global mean
- High-cardinality: Single column instead of thousands of dummies
- Exposure-aware: For frequency models with
offset="Exposure", automatically uses claim rate (ClaimCount/Exposure) instead of raw counts - Interactions: Use
target_encoding: Truein interactions to encode variable combinations
Expression Terms
result = rs.glm_dict(
response="y",
terms={
"age": {"type": "linear"},
"age2": {"type": "expression", "expr": "age ** 2"},
"age3": {"type": "expression", "expr": "age ** 3"},
"income_k": {"type": "expression", "expr": "income / 1000"},
"bmi": {"type": "expression", "expr": "weight / (height ** 2)"},
},
data=data, family="gaussian",
).fit()
Supported operations: +, -, *, /, ** (power)
Regularization
CV-Based Regularization
# Just specify regularization type - cv=5 is automatic
result = rs.glm_dict(
response="y",
terms={"x1": {"type": "linear"}, "x2": {"type": "linear"}, "cat": {"type": "categorical"}},
data=data,
family="poisson",
).fit(regularization="ridge") # "ridge", "lasso", or "elastic_net"
print(f"Selected alpha: {result.alpha}")
print(f"CV deviance: {result.cv_deviance}")
**Options:
Related Skills
node-connect
343.1kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
90.0kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
343.1kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
343.1kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
