GPBoost

Tree-Boosting, Gaussian Processes, and Mixed-Effects Models

Generate Convert Improve

Install / Use

/learn @fabsig/GPBoost

About this skill

Quality Score

0/100

README

GPBoost: Tree-Boosting, Gaussian Processes, and Mixed-Effects Models

Introduction
Modeling background
News
Open issues - contribute
References
License

Introduction

GPBoost is a software library for tree-boosting, Gaussian processes, and mixed-effects models (aka latent Gaussian variable models). It allows for combining tree-boosting with Gaussian process and random effects models ( = GPBoost algorithm) as wells as for independently applying Gaussian processes, (generalized) linear mixed effects models (LMMs and GLMMs), and tree-boosting. The GPBoost library is predominantly written in C++, it has a C interface, and there exist both a Python package and an R package.

For more information, you may want to have a look at:

The Python package and R package including installation instructions
The companion articles Sigrist (2022, JMLR) and Sigrist (2023, TPAMI) for background on the methodology
Detailed Python examples and R examples
Main parameters: the most important parameters / settings for the GPBoost library

The following blog posts:
The CLI installation guide explaining how to install the command line interface (CLI) version
Comments on computational efficiency and large data
The documentation at https://gpboost.readthedocs.io

Modeling background

The GPBoost algorithm combines tree-boosting with latent Gaussian models such as Gaussian process (GP) and grouped random effects models. This allows to leverage advantages and remedy drawbacks of both tree-boosting and latent Gaussian models; see below for a list of strength and weaknesses of these two modeling approaches. The GPBoost algorithm can be seen as a generalization of both traditional (generalized) linear mixed effects and Gaussian process models and classical independent tree-boosting (which often has the highest prediction for tabular data).

Advantages of the GPBoost algorithm

Compared to (generalized) linear mixed effects and Gaussian process models, the GPBoost algorithm allows for

modeling the fixed effects function in a non-parametric and non-linear manner which can result in more realistic models which, consequently, have higher prediction accuracy

Compared to classical independent boosting, the GPBoost algorithm allows for

more efficient learning of predictor functions which, among other things, can translate into increased prediction accuracy
efficient modeling of high-cardinality categorical variables
modeling spatial or spatio-temporal data when, e.g., spatial predictions should vary continuously , or smoothly, over space

Modeling details

For Gaussian likelihoods (GPBoost algorithm), it is assumed that the response variable (aka label) y is the sum of a potentially non-linear mean function F(X) and random effects Zb:

y = F(X) + Zb + xi

where F(X) is a sum (="ensemble") of trees, xi is an independent error term, and X are predictor variables (aka covariates or features). The random effects Zb can currently consist of:

Gaussian processes (including random coefficient processes)
Grouped random effects (including nested, crossed, and random coefficient effects)
Combinations of the above

For non-Gaussian likelihoods (LaGaBoost algorithm), it is assumed that the response variable y follows a distribution p(y|m) and that a (potentially multivariate) parameter m of this distribution is related to a non-linear function F(X) and random effects Zb:

y ~ p(y|m)
m = G(F(X) + Zb)

where G() is a so-called link function. See here for a list of currently supported likelihoods p(y|m).

Estimating or training the above-mentioned models means learning both the covariance parameters (aka hyperparameters) of the random effects and the predictor function F(X). Both the GPBoost and the LaGaBoost algorithms iteratively learn the covariance parameters and add a tree to the ensemble of trees F(X) using a functional gradient and/or a Newton boosting step. See Sigrist (2022, JMLR) and Sigrist (2023, TPAMI) for more details.

Strength and weaknesses of tree-boosting and linear mixed effects and GP models

Classical independent tree-boosting

| Strengths | Weaknesses | |:--- |:--- | | - State-of-the-art prediction accuracy | - Assumes conditional independence of samples | | - Automatic modeling of non-linearities, discontinuities, and complex high-order interactions | - Produces discontinuous predictions for, e.g., spatial data | | - Robust to outliers in and multicollinearity among predictor variables | - Can have difficulty with high-cardinality categorical variables | | - Scale-invariant to monotone transformations of predictor variables | | | - Automatic handling of missing values in predictor variables | |

Linear mixed effects and Gaussian process (GPs) models (aka latent Gaussian models)

| Strengths | Weaknesses | |:--- |:--- | | - Probabilistic predictions which allows for uncertainty quantification | - Zero or a linear prior mean (predictor, fixed effects) function | | - Incorporation of reasonable prior knowledge. E.g. for spatial data: "close samples are more similar to each other than distant samples" and a function should vary continuously / smoothly over space | | | - Modeling of dependency which, among other things, can allow for more efficient learning of the fixed effects (predictor) function | | | - Grouped random effects can be used for modeling high-cardinality categorical variables | |

News

See the GitHub releases page
October 2022: Glad to announce that the two companion articles are published in the Journal of Machine Learning Research (JMLR) and IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)
04/06/2020 : First release of GPBoost

Open issues - contribute

See the open issues on GitHub with an enhancement label

Software issues

Add Python tests (see corresponding R tests)
Setting up a CI environment
Support conversion of GPBoost models to ONNX model format

Methodological issues

Support multivariate models, e.g., using coregionalization
Support areal models for spatial data such as CAR and SAR models
Support multiclass classification, i.e., multinomial likelihoods
Support sample weights for Gaussian likelihoods
Support oth

Related Skills

claude-opus-4-5-migration

108.0k

Migrate prompts and code from Claude Sonnet 4.0, Sonnet 4.5, or Opus 4.1 to Opus 4.5

model-usage

347.2k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

feishu-drive

347.2k

things-mac

347.2k

Manage Things 3 via the `things` CLI on macOS (add/update projects+todos via URL scheme; read/search/list from the local Things database)

fabsig

View profile

View on GitHub

GitHub Stars669

CategoryData

Updated6h ago

Forks54

fabsig/GPBoost

Languages

C++

Security Score

85/100

Audited on Apr 3, 2026

No findings

GPBoost

Install / Use

README

GPBoost: Tree-Boosting, Gaussian Processes, and Mixed-Effects Models

Table of Contents

Introduction

Modeling background

Advantages of the GPBoost algorithm

Modeling details

Strength and weaknesses of tree-boosting and linear mixed effects and GP models

Classical independent tree-boosting

Linear mixed effects and Gaussian process (GPs) models (aka latent Gaussian models)

News

Open issues - contribute

Software issues

Methodological issues

Related Skills