Smallml

SmallML is a three-layer Bayesian framework that enables small and medium-sized enterprises (SMEs) to build production-grade machine learning models despite having limited customer data.

Generate Convert Improve

Install / Use

/learn @seemyon/Smallml

About this skill

Quality Score

0/100

README

SmallML: Bayesian Transfer Learning for Small Data

Build production-grade machine learning models with just 50-200 observations per business entity.

SmallML combines transfer learning, hierarchical Bayesian inference, and conformal prediction to enable SMEs to achieve reliable predictive analytics despite limited data.

🎯 Key Features

Works with tiny datasets: 50-200 observations per entity, 3-10 entities total
Transfer learning: Extracts knowledge from 100K+ public observations (pre-trained priors included)
Hierarchical pooling: Shares statistical strength across multiple business entities
Uncertainty guarantees: Bayesian credible intervals + distribution-free prediction sets
Production-ready: <30 minutes training, <100ms inference, automatic convergence validation

🚀 Quick Start

Installation

pip install smallml

Development Version

To install the latest development version from GitHub:

pip install git+https://github.com/seemyon/smallml@main

Basic Usage (5 lines of code!)

from smallml import Pipeline
import pandas as pd

# Your data: dict of {entity_name: dataframe}
sme_data = {
    'store_1': pd.read_csv('store_1.csv'),  # 80 customers
    'store_2': pd.read_csv('store_2.csv'),  # 120 customers
    'store_3': pd.read_csv('store_3.csv'),  # 95 customers
    # ... 3-10 stores total
}

# Create and fit pipeline (automatically validates convergence)
pipeline = Pipeline()
pipeline.fit(sme_data, target_col='churned')

# Make predictions with uncertainty
predictions = pipeline.predict(new_customers, sme_id='store_1')
print(predictions)
#    prediction  bayesian_std  bayesian_lower_90  bayesian_upper_90  conformal_set  conformal_set_size
# 0       0.23          0.12                0.04               0.42            {0}                   1
# 1       0.78          0.15                0.51               0.95            {1}                   1
# 2       0.51          0.21                0.18               0.84          {0,1}                   2  # Uncertain!

See full tutorial →

📚 How It Works

SmallML uses a two-layer architecture:

Layer 2 (Hierarchical Bayesian): Pools information across J entities using PyMC NUTS sampler
- Uses pre-trained priors from 100K+ public observations
- Returns full posterior distributions, not just point estimates
- Automatic convergence validation (R̂ < 1.01, ESS > 400)
Layer 3 (Conformal Prediction): Provides distribution-free uncertainty
- Split-conformal calibration for coverage guarantees
- Returns prediction sets: {0} (certain), {1} (certain), or {0,1} (uncertain)
- Empirical coverage typically 87-93% for 90% target

📊 Performance Expectations

Prediction Accuracy: 75-85% AUC on churn with 100 customers per entity
Conformal Coverage: 87-93% empirical for 90% target intervals
Training Time: 15-30 minutes for J=5 entities with 100 customers each
Inference: <100ms per prediction
Convergence: R̂ < 1.01, ESS > 400 (automatically validated)

🧪 Requirements

Data Requirements

Minimum: 3 entities, 30 observations per entity
Recommended: 5+ entities, 50+ observations per entity
Use Case: Binary classification (churn, conversion, etc.)
Features: Numerical + categorical (automatically handled)

Input Format

sme_data = {
    'entity_1': pd.DataFrame({
        'feature_1': [...],      # Numerical or categorical
        'feature_2': [...],
        'feature_3': [...],
        'churned': [0, 1, 0, ...]  # Binary target (0/1)
    }),
    'entity_2': pd.DataFrame({...}),
    # ... 3-10 entities
}

Python Requirements

Python: 3.9 or higher
Dependencies: PyMC ≥5.0, ArviZ ≥0.22.0, pandas ≥2.3, numpy ≥2.3, scikit-learn ≥1.7, scipy ≥1.16

📖 Documentation

Installation Guide: See above for basic installation
Quickstart Tutorial: examples/quickstart.py
API Reference: Check docstrings in smallml.pipeline.Pipeline
Research Paper: See docs/ for technical details

🔬 Research & Reproducibility

This package is the production-ready version of the SmallML research framework. For research code, paper reproduction, and detailed technical documentation, see:

Research Code: src/ directory
Reproduction Scripts: scripts/ directory
Technical Docs: docs/ directory
Original README: See existing README.md for research details

🎓 Citation

If you use SmallML in your research, please cite:

@software{smallml2025,
  title = {SmallML: Bayesian Transfer Learning for Small-Data Predictive Analytics},
  author = {Leontev, Semen},
  year = {2025},
  url = {https://github.com/seemyon/smallml},
}

🤝 Contributing

Contributions welcome! Please open an issue or pull request.

📝 License

This project is licensed under the MIT License - see LICENSE file.

🔗 Links

GitHub: https://github.com/seemyon/smallml
Issues: https://github.com/seemyon/smallml/issues
Paper: https://arxiv.org/abs/2511.14049

SmallML: Empowering small businesses with reliable ML despite limited data.

⚙️ Advanced Usage

Evaluating Model Performance

# Evaluate on test data
X_test, y_test = load_test_data()
metrics = pipeline.evaluate(X_test, y_test, sme_id='store_1')

print(f"AUC: {metrics['auc']:.3f}")
print(f"Accuracy: {metrics['accuracy']:.3f}")
print(f"F1 Score: {metrics['f1_score']:.3f}")
print(f"Conformal Coverage: {metrics['conformal_coverage']:.3f}")  # Should be ~0.90
print(f"Mean Set Size: {metrics['mean_set_size']:.2f}")  # 1.0 = certain, 2.0 = uncertain

Checking MCMC Convergence

# Get convergence diagnostics
diagnostics = pipeline.get_convergence_diagnostics()
print(diagnostics)
#        parameter  r_hat    ess_bulk  ess_tail
# 0       mu[0]     1.003    1845      2103
# 1       mu[1]     1.002    1923      2247
# ...

# All R̂ should be < 1.01, ESS should be > 400

Saving and Loading Pipelines

# Save fitted pipeline
pipeline.save('models/my_pipeline.pkl')

# Load later
from smallml import Pipeline
pipeline = Pipeline.load('models/my_pipeline.pkl')
predictions = pipeline.predict(new_data)

Quick Mode for Prototyping

# Faster MCMC (fewer iterations) for testing
pipeline = Pipeline(quick_mode=True)
pipeline.fit(sme_data, target_col='churned')  # Takes ~5-10 min instead of 15-30

# For production, use default settings:
pipeline = Pipeline(quick_mode=False)  # More reliable convergence

❓ FAQ

Q: What if I don't have pre-trained priors? A: You'll need to add your own priors to smallml/data/priors_churn.pkl. The package structure is ready, and you can copy your existing priors there. The file should contain {'beta_0': np.ndarray, 'Sigma_0': np.ndarray}.

Q: Can I use this for regression instead of classification? A: Currently SmallML focuses on binary classification. Regression support is planned for future versions.

Q: What if MCMC doesn't converge? A: The pipeline automatically validates convergence. If it fails, try:

Use quick_mode=False for more MCMC iterations
Ensure you have at least 50 observations per entity
Check that features are properly normalized

Q: How do I interpret conformal sets? A:

{0} = Certain prediction: will NOT churn
{1} = Certain prediction: WILL churn
{0,1} = Uncertain prediction: could go either way

Q: Can I use this with just 2 entities? A: The package will warn but still work. However, hierarchical pooling works best with 3+ entities (5+ recommended).

Related Skills

YC-Killer

2.7k

A library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.

best-practices-researcher

The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app

groundhog

398

Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).

isf-agent

a repo for an agent that helps researchers apply for isf funding