`nonprobsvy`: an R package for modern statistical inference methods based on non-probability samples <img src="man/figures/logo.png" align="right" width="150"/>

Basic information

The goal of this package is to provide R users access to modern methods for non-probability samples when auxiliary information from the population or probability sample is available:

inverse probability weighting estimators with possible calibration constraints (Y. Chen, Li, and Wu 2020),
mass imputation estimators based on nearest neighbours (Yang, Kim, and Hwang 2021), predictive mean matching (Chlebicki, Chrostowski, and Beręsewicz 2025), non-parametric (S. Chen, Yang, and Kim 2022) and regression imputation (Kim et al. 2021),
doubly robust estimators (Y. Chen, Li, and Wu 2020) with bias minimization (Yang, Kim, and Song 2020).

The package allows for:

variable section in high-dimensional space using SCAD (Yang, Kim, and Song 2020), Lasso and MCP penalty (via the ncvreg, Rcpp, RcppArmadillo packages),
estimation of variance using analytical and bootstrap approach (see Wu (2023)),
integration with the survey and srvyr packages when probability sample is available (Lumley 2004, 2023; Freedman Ellis and Schneider 2024),
different links for selection (logit, probit and cloglog) and outcome (gaussian, binomial and poisson) variables.

Details on the use of the package can be found:

see the working paper Chrostowski, Ł., Chlebicki, P., & Beręsewicz, M. (2025). nonprobsvy–An R package for modern methods for non-probability surveys. arXiv preprint arXiv:2504.04255.
in the draft (and not proofread) version of the book Modern inference methods for non-probability samples with R,
in example codes that reproduce papers available on github in the repository software tutorials.

Installation

You can install the recent version of nonprobsvy package from main branch Github with:

remotes::install_github("ncn-foreigners/nonprobsvy")

or install the stable version from CRAN

install.packages("nonprobsvy")

or development version from the dev branch

remotes::install_github("ncn-foreigners/nonprobsvy@dev")

Basic idea

Consider the following setting where two samples are available: non-probability (denoted as $S_A$) and probability (denoted as $S_B$) where set of auxiliary variables (denoted as $\boldsymbol{X}$) is available for both sources while $Y$ and $\boldsymbol{d}$ (or $\boldsymbol{w}$) is present only in probability sample.

| Sample | | Auxiliary variables $\boldsymbol{X}$ | Target variable $Y$ | Design ($\boldsymbol{d}$) or calibrated ($\boldsymbol{w}$) weights | |----|---:|:--:|:--:|:--:| | $S_A$ (non-probability) | 1 | $\checkmark$ | $\checkmark$ | ? | | | … | $\checkmark$ | $\checkmark$ | ? | | | $n_A$ | $\checkmark$ | $\checkmark$ | ? | | $S_B$ (probability) | $n_A+1$ | $\checkmark$ | ? | $\checkmark$ | | | … | $\checkmark$ | ? | $\checkmark$ | | | $n_A+n_B$ | $\checkmark$ | ? | $\checkmark$ |

Basic functionalities

Suppose $Y$ is the target variable, $\boldsymbol{X}$ is a matrix of auxiliary variables, $R$ is the inclusion indicator. Then, if we are interested in estimating the mean $\bar{\tau}_Y$ or the sum $\tau_Y$ of the of the target variable given the observed data set $(y_k, \boldsymbol{x}_k, R_k)$, we can approach this problem with the possible scenarios:

unit-level data is available for the non-probability sample $S_{A}$, i.e. $(y_k,\boldsymbol{x}k)$ is available for all units $k \in S{A}$, and population-level data is available for $\boldsymbol{x}1,\ldots,\boldsymbol{x}p$, denoted as $\tau{x{1}},\tau_{x_{2}},\ldots,\tau_{x_{p}}$ and population size $N$ is known. We can also consider situations where population data are estimated (e.g. on the basis of a survey to which we do not have access),
unit-level data is available for the non-probability sample $S_A$ and the probability sample $S_B$, i.e. $(y_k,\boldsymbol{x}_k,R_k)$ is determined by the data. is determined by the data: $R_k=1$ if $k \in S_A$ otherwise $R_k=0$, $y_k$ is observed only for sample $S_A$ and $\boldsymbol{x}_k$ is observed in both in both $S_A$ and $S_B$,

When unit-level data is available for non-probability survey only

Estimator

</th> <th>

Example code

</th> <tr> <tr> <td>

Mass imputation based on regression imputation

</td> <td>

nonprob(
  outcome = y ~ x1 + x2 + ... + xk,
  data = nonprob,
  pop_totals = c(`(Intercept)`= N,
                 x1 = tau_x1,
                 x2 = tau_x2,
                 ...,
                 xk = tau_xk),
  method_outcome = "glm",
  family_outcome = "gaussian"
)

</td> <tr> <tr> <td>

Inverse probability weighting

</td> <td>

nonprob(
  selection =  ~ x1 + x2 + ... + xk, 
  target = ~ y, 
  data = nonprob, 
  pop_totals = c(`(Intercept)` = N, 
                 x1 = tau_x1, 
                 x2 = tau_x2, 
                 ..., 
                 xk = tau_xk), 
  method_selection = "logit"
)

</td> <tr> <tr> <td>

Inverse probability weighting with calibration constraint

</td> <td>

nonprob(
  selection =  ~ x1 + x2 + ... + xk, 
  target = ~ y, 
  data = nonprob, 
  pop_totals = c(`(Intercept)`= N, 
                 x1 = tau_x1, 
                 x2 = tau_x2, 
                 ..., 
                 xk = tau_xk), 
  method_selection = "logit", 
  control_selection = control_sel(est_method = "gee", gee_h_fun = 1)
)

</td> <tr> <tr> <td>

Doubly robust estimator

</td> <td>

nonprob(
  selection = ~ x1 + x2 + ... + xk, 
  outcome = y ~ x1 + x2 + …, + xk, 
  pop_totals = c(`(Intercept)` = N, 
                 x1 = tau_x1, 
                 x2 = tau_x2, 
                 ..., 
                 xk = tau_xk), 
  svydesign = prob, 
  method_outcome = "glm", 
  family_outcome = "gaussian"
)

</td> <tr> </table>

When unit-level data are available for both surveys

Estimator

</th> <th>

Example code

</th> <tr> <tr> <td>

Mass imputation based on regression imputation

</td> <td>

nonprob(
  outcome = y ~ x1 + x2 + ... + xk, 
  data = nonprob, 
  svydesign = prob, 
  method_outcome = "glm", 
  family_outcome = "gaussian"
)

</td> <tr> <tr> <td>

Mass imputation based on nearest neighbour imputation

</td> <td>

nonprob(
  outcome = y ~ x1 + x2 + ... + xk, 
  data = nonprob, 
  svydesign = prob, 
  method_outcome = "nn", 
  family_outcome = "gaussian", 
  control_outcome = control_outcome(k = 2)
)

</td> <tr> <tr> <td>

Mass imputation based on predictive mean matching

</td> <td>

nonprob(
  outcome = y ~ x1 + x2 + ... + xk, 
  data = nonprob, 
  svydesign = prob, 
  method_outcome = "pmm", 
  family_outcome = "gaussian"
)

</td> <tr> <tr> <td>

Mass imputation based on regression imputation with variable selection (LASSO)

</td> <td>

nonprob(
  outcome = y ~ x1 + x2 + ... + xk, 
  data = nonprob, 
  svydesign = prob, 
  method_outcome = "pmm", 
  family_outcome = "gaussian", 
  control_outcome = control_out(penalty = "lasso"), 
  control_inference = control_inf(vars_selection = TRUE)
)

</td> <tr> <tr> <td>

Inverse probability weighting

</td> <td>

nonprob(
  selection =  ~ x1 + x2 + ... + xk, 
  target = ~ y, 
  data = nonprob, 
  svydesign = prob, 
  method_selection = "logit"
)

</td> <tr> <tr> <td>

Inverse probability weighting with calibration constraint

</td> <td>

nonprob(
  selection =  ~ x1 + x2 + ... + xk, 
  target = ~ y, 
  data = nonprob, 
  svydesign = prob, 
  method_selection = "logit", 
  control_selection = control_sel(est_method = "gee", gee_h_fun = 1)
)

</td> <tr> <tr> <td>

Inverse probability weighting with calibration constraint with variable selection (SCAD)

</td> <td>

nonprob(
  selection =  ~ x1 + x2 + ... + xk, 
  target = ~ y, 
  data = nonprob, 
  svydesign = prob, 
  method_outcome = "pmm", 
  family_outcome = "gaussian", 
  control_inference = control_inf(vars_selection = TRUE)
)

</td> <tr> <tr> <td>

Doubly robust estimator

</td> <td>

nonprob(
  selection = ~ x1 + x2 + ... + xk, 
  outcome = y ~ x1 + x2 + ... + xk, 
  data = nonprob, 
  svydesign = prob, 
  method_outcome = "glm", 
  family_outcome = "gaussian"
)

</td> <tr> <tr> <td>

Doubly robust estimator with variable selection (SCA

Nonprobsvy

Install / Use

README

`nonprobsvy`: an R package for modern statistical inference methods based on non-probability samples <img src="man/figures/logo.png" align="right" width="150"/>

Basic information

Installation

Basic idea

Basic functionalities

When unit-level data is available for non-probability survey only

When unit-level data are available for both surveys

Nonprobsvy

Install / Use

README

nonprobsvy: an R package for modern statistical inference methods based on non-probability samples <img src="man/figures/logo.png" align="right" width="150"/>

Basic information

Installation

Basic idea

Basic functionalities

When unit-level data is available for non-probability survey only

When unit-level data are available for both surveys

`nonprobsvy`: an R package for modern statistical inference methods based on non-probability samples <img src="man/figures/logo.png" align="right" width="150"/>