Smallml
SmallML is a three-layer Bayesian framework that enables small and medium-sized enterprises (SMEs) to build production-grade machine learning models despite having limited customer data.
Install / Use
/learn @seemyon/SmallmlREADME
SmallML: Bayesian Transfer Learning for Small Data
Build production-grade machine learning models with just 50-200 observations per business entity.
SmallML combines transfer learning, hierarchical Bayesian inference, and conformal prediction to enable SMEs to achieve reliable predictive analytics despite limited data.
🎯 Key Features
- Works with tiny datasets: 50-200 observations per entity, 3-10 entities total
- Transfer learning: Extracts knowledge from 100K+ public observations (pre-trained priors included)
- Hierarchical pooling: Shares statistical strength across multiple business entities
- Uncertainty guarantees: Bayesian credible intervals + distribution-free prediction sets
- Production-ready: <30 minutes training, <100ms inference, automatic convergence validation
🚀 Quick Start
Installation
pip install smallml
Development Version
To install the latest development version from GitHub:
pip install git+https://github.com/seemyon/smallml@main
Basic Usage (5 lines of code!)
from smallml import Pipeline
import pandas as pd
# Your data: dict of {entity_name: dataframe}
sme_data = {
'store_1': pd.read_csv('store_1.csv'), # 80 customers
'store_2': pd.read_csv('store_2.csv'), # 120 customers
'store_3': pd.read_csv('store_3.csv'), # 95 customers
# ... 3-10 stores total
}
# Create and fit pipeline (automatically validates convergence)
pipeline = Pipeline()
pipeline.fit(sme_data, target_col='churned')
# Make predictions with uncertainty
predictions = pipeline.predict(new_customers, sme_id='store_1')
print(predictions)
# prediction bayesian_std bayesian_lower_90 bayesian_upper_90 conformal_set conformal_set_size
# 0 0.23 0.12 0.04 0.42 {0} 1
# 1 0.78 0.15 0.51 0.95 {1} 1
# 2 0.51 0.21 0.18 0.84 {0,1} 2 # Uncertain!
📚 How It Works
SmallML uses a two-layer architecture:
-
Layer 2 (Hierarchical Bayesian): Pools information across J entities using PyMC NUTS sampler
- Uses pre-trained priors from 100K+ public observations
- Returns full posterior distributions, not just point estimates
- Automatic convergence validation (R̂ < 1.01, ESS > 400)
-
Layer 3 (Conformal Prediction): Provides distribution-free uncertainty
- Split-conformal calibration for coverage guarantees
- Returns prediction sets: {0} (certain), {1} (certain), or {0,1} (uncertain)
- Empirical coverage typically 87-93% for 90% target
📊 Performance Expectations
- Prediction Accuracy: 75-85% AUC on churn with 100 customers per entity
- Conformal Coverage: 87-93% empirical for 90% target intervals
- Training Time: 15-30 minutes for J=5 entities with 100 customers each
- Inference: <100ms per prediction
- Convergence: R̂ < 1.01, ESS > 400 (automatically validated)
🧪 Requirements
Data Requirements
- Minimum: 3 entities, 30 observations per entity
- Recommended: 5+ entities, 50+ observations per entity
- Use Case: Binary classification (churn, conversion, etc.)
- Features: Numerical + categorical (automatically handled)
Input Format
sme_data = {
'entity_1': pd.DataFrame({
'feature_1': [...], # Numerical or categorical
'feature_2': [...],
'feature_3': [...],
'churned': [0, 1, 0, ...] # Binary target (0/1)
}),
'entity_2': pd.DataFrame({...}),
# ... 3-10 entities
}
Python Requirements
- Python: 3.9 or higher
- Dependencies: PyMC ≥5.0, ArviZ ≥0.22.0, pandas ≥2.3, numpy ≥2.3, scikit-learn ≥1.7, scipy ≥1.16
📖 Documentation
- Installation Guide: See above for basic installation
- Quickstart Tutorial:
examples/quickstart.py - API Reference: Check docstrings in
smallml.pipeline.Pipeline - Research Paper: See
docs/for technical details
🔬 Research & Reproducibility
This package is the production-ready version of the SmallML research framework. For research code, paper reproduction, and detailed technical documentation, see:
- Research Code:
src/directory - Reproduction Scripts:
scripts/directory - Technical Docs:
docs/directory - Original README: See existing README.md for research details
🎓 Citation
If you use SmallML in your research, please cite:
@software{smallml2025,
title = {SmallML: Bayesian Transfer Learning for Small-Data Predictive Analytics},
author = {Leontev, Semen},
year = {2025},
url = {https://github.com/seemyon/smallml},
}
🤝 Contributing
Contributions welcome! Please open an issue or pull request.
📝 License
This project is licensed under the MIT License - see LICENSE file.
🔗 Links
- GitHub: https://github.com/seemyon/smallml
- Issues: https://github.com/seemyon/smallml/issues
- Paper: https://arxiv.org/abs/2511.14049
SmallML: Empowering small businesses with reliable ML despite limited data.
⚙️ Advanced Usage
Evaluating Model Performance
# Evaluate on test data
X_test, y_test = load_test_data()
metrics = pipeline.evaluate(X_test, y_test, sme_id='store_1')
print(f"AUC: {metrics['auc']:.3f}")
print(f"Accuracy: {metrics['accuracy']:.3f}")
print(f"F1 Score: {metrics['f1_score']:.3f}")
print(f"Conformal Coverage: {metrics['conformal_coverage']:.3f}") # Should be ~0.90
print(f"Mean Set Size: {metrics['mean_set_size']:.2f}") # 1.0 = certain, 2.0 = uncertain
Checking MCMC Convergence
# Get convergence diagnostics
diagnostics = pipeline.get_convergence_diagnostics()
print(diagnostics)
# parameter r_hat ess_bulk ess_tail
# 0 mu[0] 1.003 1845 2103
# 1 mu[1] 1.002 1923 2247
# ...
# All R̂ should be < 1.01, ESS should be > 400
Saving and Loading Pipelines
# Save fitted pipeline
pipeline.save('models/my_pipeline.pkl')
# Load later
from smallml import Pipeline
pipeline = Pipeline.load('models/my_pipeline.pkl')
predictions = pipeline.predict(new_data)
Quick Mode for Prototyping
# Faster MCMC (fewer iterations) for testing
pipeline = Pipeline(quick_mode=True)
pipeline.fit(sme_data, target_col='churned') # Takes ~5-10 min instead of 15-30
# For production, use default settings:
pipeline = Pipeline(quick_mode=False) # More reliable convergence
❓ FAQ
Q: What if I don't have pre-trained priors?
A: You'll need to add your own priors to smallml/data/priors_churn.pkl. The package structure is ready, and you can copy your existing priors there. The file should contain {'beta_0': np.ndarray, 'Sigma_0': np.ndarray}.
Q: Can I use this for regression instead of classification? A: Currently SmallML focuses on binary classification. Regression support is planned for future versions.
Q: What if MCMC doesn't converge? A: The pipeline automatically validates convergence. If it fails, try:
- Use
quick_mode=Falsefor more MCMC iterations - Ensure you have at least 50 observations per entity
- Check that features are properly normalized
Q: How do I interpret conformal sets? A:
{0}= Certain prediction: will NOT churn{1}= Certain prediction: WILL churn{0,1}= Uncertain prediction: could go either way
Q: Can I use this with just 2 entities? A: The package will warn but still work. However, hierarchical pooling works best with 3+ entities (5+ recommended).
Related Skills
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
best-practices-researcher
The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app
groundhog
398Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
isf-agent
a repo for an agent that helps researchers apply for isf funding
