SkillAgentSearch skills...

Nonprobsvy

An R package for modern methods for non-probability samples

Install / Use

/learn @ncn-foreigners/Nonprobsvy

README

<!-- README.md is generated from README.Rmd. Please edit that file -->

nonprobsvy: an R package for modern statistical inference methods based on non-probability samples <img src="man/figures/logo.png" align="right" width="150"/>

<!-- badges: start -->

R-CMD-check Codecov test
coverage DOI CRAN
status CRAN
downloads CRAN
downloads Mentioned in Awesome Official
Statistics

<!-- badges: end -->

Basic information

The goal of this package is to provide R users access to modern methods for non-probability samples when auxiliary information from the population or probability sample is available:

The package allows for:

  • variable section in high-dimensional space using SCAD (Yang, Kim, and Song 2020), Lasso and MCP penalty (via the ncvreg, Rcpp, RcppArmadillo packages),
  • estimation of variance using analytical and bootstrap approach (see Wu (2023)),
  • integration with the survey and srvyr packages when probability sample is available (Lumley 2004, 2023; Freedman Ellis and Schneider 2024),
  • different links for selection (logit, probit and cloglog) and outcome (gaussian, binomial and poisson) variables.

Details on the use of the package can be found:

Installation

You can install the recent version of nonprobsvy package from main branch Github with:

remotes::install_github("ncn-foreigners/nonprobsvy")

or install the stable version from CRAN

install.packages("nonprobsvy")

or development version from the dev branch

remotes::install_github("ncn-foreigners/nonprobsvy@dev")

Basic idea

Consider the following setting where two samples are available: non-probability (denoted as $S_A$) and probability (denoted as $S_B$) where set of auxiliary variables (denoted as $\boldsymbol{X}$) is available for both sources while $Y$ and $\boldsymbol{d}$ (or $\boldsymbol{w}$) is present only in probability sample.

| Sample | | Auxiliary variables $\boldsymbol{X}$ | Target variable $Y$ | Design ($\boldsymbol{d}$) or calibrated ($\boldsymbol{w}$) weights | |----|---:|:--:|:--:|:--:| | $S_A$ (non-probability) | 1 | $\checkmark$ | $\checkmark$ | ? | | | … | $\checkmark$ | $\checkmark$ | ? | | | $n_A$ | $\checkmark$ | $\checkmark$ | ? | | $S_B$ (probability) | $n_A+1$ | $\checkmark$ | ? | $\checkmark$ | | | … | $\checkmark$ | ? | $\checkmark$ | | | $n_A+n_B$ | $\checkmark$ | ? | $\checkmark$ |

Basic functionalities

Suppose $Y$ is the target variable, $\boldsymbol{X}$ is a matrix of auxiliary variables, $R$ is the inclusion indicator. Then, if we are interested in estimating the mean $\bar{\tau}_Y$ or the sum $\tau_Y$ of the of the target variable given the observed data set $(y_k, \boldsymbol{x}_k, R_k)$, we can approach this problem with the possible scenarios:

  • unit-level data is available for the non-probability sample $S_{A}$, i.e. $(y_k,\boldsymbol{x}k)$ is available for all units $k \in S{A}$, and population-level data is available for $\boldsymbol{x}1,\ldots,\boldsymbol{x}p$, denoted as $\tau{x{1}},\tau_{x_{2}},\ldots,\tau_{x_{p}}$ and population size $N$ is known. We can also consider situations where population data are estimated (e.g. on the basis of a survey to which we do not have access),
  • unit-level data is available for the non-probability sample $S_A$ and the probability sample $S_B$, i.e. $(y_k,\boldsymbol{x}_k,R_k)$ is determined by the data. is determined by the data: $R_k=1$ if $k \in S_A$ otherwise $R_k=0$, $y_k$ is observed only for sample $S_A$ and $\boldsymbol{x}_k$ is observed in both in both $S_A$ and $S_B$,

When unit-level data is available for non-probability survey only

<table class='table'> <tr> <th>

Estimator

</th> <th>

Example code

</th> <tr> <tr> <td>

Mass imputation based on regression imputation

</td> <td>
nonprob(
  outcome = y ~ x1 + x2 + ... + xk,
  data = nonprob,
  pop_totals = c(`(Intercept)`= N,
                 x1 = tau_x1,
                 x2 = tau_x2,
                 ...,
                 xk = tau_xk),
  method_outcome = "glm",
  family_outcome = "gaussian"
)
</td> <tr> <tr> <td>

Inverse probability weighting

</td> <td>
nonprob(
  selection =  ~ x1 + x2 + ... + xk, 
  target = ~ y, 
  data = nonprob, 
  pop_totals = c(`(Intercept)` = N, 
                 x1 = tau_x1, 
                 x2 = tau_x2, 
                 ..., 
                 xk = tau_xk), 
  method_selection = "logit"
)
</td> <tr> <tr> <td>

Inverse probability weighting with calibration constraint

</td> <td>
nonprob(
  selection =  ~ x1 + x2 + ... + xk, 
  target = ~ y, 
  data = nonprob, 
  pop_totals = c(`(Intercept)`= N, 
                 x1 = tau_x1, 
                 x2 = tau_x2, 
                 ..., 
                 xk = tau_xk), 
  method_selection = "logit", 
  control_selection = control_sel(est_method = "gee", gee_h_fun = 1)
)
</td> <tr> <tr> <td>

Doubly robust estimator

</td> <td>
nonprob(
  selection = ~ x1 + x2 + ... + xk, 
  outcome = y ~ x1 + x2 + …, + xk, 
  pop_totals = c(`(Intercept)` = N, 
                 x1 = tau_x1, 
                 x2 = tau_x2, 
                 ..., 
                 xk = tau_xk), 
  svydesign = prob, 
  method_outcome = "glm", 
  family_outcome = "gaussian"
)
</td> <tr> </table>

When unit-level data are available for both surveys

<table class='table'> <tr> <th>

Estimator

</th> <th>

Example code

</th> <tr> <tr> <td>

Mass imputation based on regression imputation

</td> <td>
nonprob(
  outcome = y ~ x1 + x2 + ... + xk, 
  data = nonprob, 
  svydesign = prob, 
  method_outcome = "glm", 
  family_outcome = "gaussian"
)
</td> <tr> <tr> <td>

Mass imputation based on nearest neighbour imputation

</td> <td>
nonprob(
  outcome = y ~ x1 + x2 + ... + xk, 
  data = nonprob, 
  svydesign = prob, 
  method_outcome = "nn", 
  family_outcome = "gaussian", 
  control_outcome = control_outcome(k = 2)
)
</td> <tr> <tr> <td>

Mass imputation based on predictive mean matching

</td> <td>
nonprob(
  outcome = y ~ x1 + x2 + ... + xk, 
  data = nonprob, 
  svydesign = prob, 
  method_outcome = "pmm", 
  family_outcome = "gaussian"
)
</td> <tr> <tr> <td>

Mass imputation based on regression imputation with variable selection (LASSO)

</td> <td>
nonprob(
  outcome = y ~ x1 + x2 + ... + xk, 
  data = nonprob, 
  svydesign = prob, 
  method_outcome = "pmm", 
  family_outcome = "gaussian", 
  control_outcome = control_out(penalty = "lasso"), 
  control_inference = control_inf(vars_selection = TRUE)
)
</td> <tr> <tr> <td>

Inverse probability weighting

</td> <td>
nonprob(
  selection =  ~ x1 + x2 + ... + xk, 
  target = ~ y, 
  data = nonprob, 
  svydesign = prob, 
  method_selection = "logit"
)
</td> <tr> <tr> <td>

Inverse probability weighting with calibration constraint

</td> <td>
nonprob(
  selection =  ~ x1 + x2 + ... + xk, 
  target = ~ y, 
  data = nonprob, 
  svydesign = prob, 
  method_selection = "logit", 
  control_selection = control_sel(est_method = "gee", gee_h_fun = 1)
)
</td> <tr> <tr> <td>

Inverse probability weighting with calibration constraint with variable selection (SCAD)

</td> <td>
nonprob(
  selection =  ~ x1 + x2 + ... + xk, 
  target = ~ y, 
  data = nonprob, 
  svydesign = prob, 
  method_outcome = "pmm", 
  family_outcome = "gaussian", 
  control_inference = control_inf(vars_selection = TRUE)
)
</td> <tr> <tr> <td>

Doubly robust estimator

</td> <td>
nonprob(
  selection = ~ x1 + x2 + ... + xk, 
  outcome = y ~ x1 + x2 + ... + xk, 
  data = nonprob, 
  svydesign = prob, 
  method_outcome = "glm", 
  family_outcome = "gaussian"
)
</td> <tr> <tr> <td>

Doubly robust estimator with variable selection (SCA

View on GitHub
GitHub Stars55
CategoryDevelopment
Updated9d ago
Forks5

Languages

R

Security Score

85/100

Audited on Apr 1, 2026

No findings