Nonprobsvy
An R package for modern methods for non-probability samples
Install / Use
/learn @ncn-foreigners/NonprobsvyREADME
nonprobsvy: an R package for modern statistical inference methods based on non-probability samples <img src="man/figures/logo.png" align="right" width="150"/>
<!-- badges: start -->
<!-- badges: end -->
Basic information
The goal of this package is to provide R users access to modern methods for non-probability samples when auxiliary information from the population or probability sample is available:
- inverse probability weighting estimators with possible calibration constraints (Y. Chen, Li, and Wu 2020),
- mass imputation estimators based on nearest neighbours (Yang, Kim, and Hwang 2021), predictive mean matching (Chlebicki, Chrostowski, and Beręsewicz 2025), non-parametric (S. Chen, Yang, and Kim 2022) and regression imputation (Kim et al. 2021),
- doubly robust estimators (Y. Chen, Li, and Wu 2020) with bias minimization (Yang, Kim, and Song 2020).
The package allows for:
- variable section in high-dimensional space using SCAD (Yang, Kim, and
Song 2020), Lasso and MCP penalty (via the
ncvreg,Rcpp,RcppArmadillopackages), - estimation of variance using analytical and bootstrap approach (see Wu (2023)),
- integration with the
surveyandsrvyrpackages when probability sample is available (Lumley 2004, 2023; Freedman Ellis and Schneider 2024), - different links for selection (
logit,probitandcloglog) and outcome (gaussian,binomialandpoisson) variables.
Details on the use of the package can be found:
- see the working paper Chrostowski, Ł., Chlebicki, P., & Beręsewicz, M. (2025). nonprobsvy–An R package for modern methods for non-probability surveys. arXiv preprint arXiv:2504.04255.
- in the draft (and not proofread) version of the book Modern inference methods for non-probability samples with R,
- in example codes that reproduce papers available on github in the repository software tutorials.
Installation
You can install the recent version of nonprobsvy package from main
branch Github with:
remotes::install_github("ncn-foreigners/nonprobsvy")
or install the stable version from CRAN
install.packages("nonprobsvy")
or development version from the dev branch
remotes::install_github("ncn-foreigners/nonprobsvy@dev")
Basic idea
Consider the following setting where two samples are available: non-probability (denoted as $S_A$) and probability (denoted as $S_B$) where set of auxiliary variables (denoted as $\boldsymbol{X}$) is available for both sources while $Y$ and $\boldsymbol{d}$ (or $\boldsymbol{w}$) is present only in probability sample.
| Sample | | Auxiliary variables $\boldsymbol{X}$ | Target variable $Y$ | Design ($\boldsymbol{d}$) or calibrated ($\boldsymbol{w}$) weights | |----|---:|:--:|:--:|:--:| | $S_A$ (non-probability) | 1 | $\checkmark$ | $\checkmark$ | ? | | | … | $\checkmark$ | $\checkmark$ | ? | | | $n_A$ | $\checkmark$ | $\checkmark$ | ? | | $S_B$ (probability) | $n_A+1$ | $\checkmark$ | ? | $\checkmark$ | | | … | $\checkmark$ | ? | $\checkmark$ | | | $n_A+n_B$ | $\checkmark$ | ? | $\checkmark$ |
Basic functionalities
Suppose $Y$ is the target variable, $\boldsymbol{X}$ is a matrix of auxiliary variables, $R$ is the inclusion indicator. Then, if we are interested in estimating the mean $\bar{\tau}_Y$ or the sum $\tau_Y$ of the of the target variable given the observed data set $(y_k, \boldsymbol{x}_k, R_k)$, we can approach this problem with the possible scenarios:
- unit-level data is available for the non-probability sample $S_{A}$, i.e. $(y_k,\boldsymbol{x}k)$ is available for all units $k \in S{A}$, and population-level data is available for $\boldsymbol{x}1,\ldots,\boldsymbol{x}p$, denoted as $\tau{x{1}},\tau_{x_{2}},\ldots,\tau_{x_{p}}$ and population size $N$ is known. We can also consider situations where population data are estimated (e.g. on the basis of a survey to which we do not have access),
- unit-level data is available for the non-probability sample $S_A$ and the probability sample $S_B$, i.e. $(y_k,\boldsymbol{x}_k,R_k)$ is determined by the data. is determined by the data: $R_k=1$ if $k \in S_A$ otherwise $R_k=0$, $y_k$ is observed only for sample $S_A$ and $\boldsymbol{x}_k$ is observed in both in both $S_A$ and $S_B$,
When unit-level data is available for non-probability survey only
<table class='table'> <tr> <th>Estimator
</th> <th>Example code
</th> <tr> <tr> <td>Mass imputation based on regression imputation
</td> <td>nonprob(
outcome = y ~ x1 + x2 + ... + xk,
data = nonprob,
pop_totals = c(`(Intercept)`= N,
x1 = tau_x1,
x2 = tau_x2,
...,
xk = tau_xk),
method_outcome = "glm",
family_outcome = "gaussian"
)
</td>
<tr>
<tr>
<td>
Inverse probability weighting
</td> <td>nonprob(
selection = ~ x1 + x2 + ... + xk,
target = ~ y,
data = nonprob,
pop_totals = c(`(Intercept)` = N,
x1 = tau_x1,
x2 = tau_x2,
...,
xk = tau_xk),
method_selection = "logit"
)
</td>
<tr>
<tr>
<td>
Inverse probability weighting with calibration constraint
</td> <td>nonprob(
selection = ~ x1 + x2 + ... + xk,
target = ~ y,
data = nonprob,
pop_totals = c(`(Intercept)`= N,
x1 = tau_x1,
x2 = tau_x2,
...,
xk = tau_xk),
method_selection = "logit",
control_selection = control_sel(est_method = "gee", gee_h_fun = 1)
)
</td>
<tr>
<tr>
<td>
Doubly robust estimator
</td> <td>nonprob(
selection = ~ x1 + x2 + ... + xk,
outcome = y ~ x1 + x2 + …, + xk,
pop_totals = c(`(Intercept)` = N,
x1 = tau_x1,
x2 = tau_x2,
...,
xk = tau_xk),
svydesign = prob,
method_outcome = "glm",
family_outcome = "gaussian"
)
</td>
<tr>
</table>
When unit-level data are available for both surveys
<table class='table'> <tr> <th>Estimator
</th> <th>Example code
</th> <tr> <tr> <td>Mass imputation based on regression imputation
</td> <td>nonprob(
outcome = y ~ x1 + x2 + ... + xk,
data = nonprob,
svydesign = prob,
method_outcome = "glm",
family_outcome = "gaussian"
)
</td>
<tr>
<tr>
<td>
Mass imputation based on nearest neighbour imputation
</td> <td>nonprob(
outcome = y ~ x1 + x2 + ... + xk,
data = nonprob,
svydesign = prob,
method_outcome = "nn",
family_outcome = "gaussian",
control_outcome = control_outcome(k = 2)
)
</td>
<tr>
<tr>
<td>
Mass imputation based on predictive mean matching
</td> <td>nonprob(
outcome = y ~ x1 + x2 + ... + xk,
data = nonprob,
svydesign = prob,
method_outcome = "pmm",
family_outcome = "gaussian"
)
</td>
<tr>
<tr>
<td>
Mass imputation based on regression imputation with variable selection (LASSO)
</td> <td>nonprob(
outcome = y ~ x1 + x2 + ... + xk,
data = nonprob,
svydesign = prob,
method_outcome = "pmm",
family_outcome = "gaussian",
control_outcome = control_out(penalty = "lasso"),
control_inference = control_inf(vars_selection = TRUE)
)
</td>
<tr>
<tr>
<td>
Inverse probability weighting
</td> <td>nonprob(
selection = ~ x1 + x2 + ... + xk,
target = ~ y,
data = nonprob,
svydesign = prob,
method_selection = "logit"
)
</td>
<tr>
<tr>
<td>
Inverse probability weighting with calibration constraint
</td> <td>nonprob(
selection = ~ x1 + x2 + ... + xk,
target = ~ y,
data = nonprob,
svydesign = prob,
method_selection = "logit",
control_selection = control_sel(est_method = "gee", gee_h_fun = 1)
)
</td>
<tr>
<tr>
<td>
Inverse probability weighting with calibration constraint with variable selection (SCAD)
</td> <td>nonprob(
selection = ~ x1 + x2 + ... + xk,
target = ~ y,
data = nonprob,
svydesign = prob,
method_outcome = "pmm",
family_outcome = "gaussian",
control_inference = control_inf(vars_selection = TRUE)
)
</td>
<tr>
<tr>
<td>
Doubly robust estimator
</td> <td>nonprob(
selection = ~ x1 + x2 + ... + xk,
outcome = y ~ x1 + x2 + ... + xk,
data = nonprob,
svydesign = prob,
method_outcome = "glm",
family_outcome = "gaussian"
)
</td>
<tr>
<tr>
<td>
Doubly robust estimator with variable selection (SCA
