--- title: "nonprobsvy -- An R Package for Modern Methods for Non-Probability Surveys" author: - Łukasz Chrostowski (Analytic Partners) - Piotr Chlebicki (Stockholm University) - Maciej Beręsewicz (Poznań University of Economics and Business; Statistical Office in Poznań) date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 3 bibliography: nonprobsvy.bib link-citations: true vignette: > %\VignetteIndexEntry{nonprobsvy -- An R Package for Modern Methods for Non-Probability Surveys} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} options(prompt = "R> ", continue = "+ ", width = 76) knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 4.5, fig.align = "center" ) ``` > **Abstract.** The paper presents **nonprobsvy** -- an R package for inference > based on non-probability samples. The package implements various approaches > that can be categorized into three groups: prediction-based approach, inverse > probability weighting and doubly robust approach. In the package, we assume > the existence of either population-level data or probability-based population > information and leverage the **survey** package for inference. The package > implements both analytical and bootstrap variance estimation for the proposed > estimators. In the paper we present the theory behind the package, its > functionalities and a case study that showcases the usage of the package. The > package is aimed at scientists and researchers who would like to use > non-probability samples (e.g., big data, opt-in web panels, social media) to > accurately estimate population characteristics. > > *Keywords*: data integration, doubly robust estimation, propensity score > estimation, mass imputation, **survey**. # Introduction In official statistics, information about the target population and its characteristics is mainly collected through probability surveys, censuses or is obtained from administrative registers and covers all (or nearly all) units of the population. However, owing to increasing non-response rates, particularly unit non-response and non-contact, which result from the growing respondent burden as well as rising costs of surveys conducted by National Statistical Institutes, non-probability data sources are becoming more popular [@berkesewicz2017two; @beaumont2020probability; @biffignandi2021handbook]. Non-probability surveys, such as opt-in web panels, social media, scanner data, mobile phone data or voluntary register data, are currently being explored for use in the production of official statistics [@citro2014multiple; @daas2015big], public opinion studies [@Schonlau2017] or market research [cf. @Grow2022]. Since the selection mechanism underlying these sources is unknown, standard design-based inference methods cannot be directly applied and, in the case of large datasets, can lead to the big data paradox, i.e., the larger the sample, the larger the bias, as described by @meng2018statistical. Table 1 compares basic characteristics of probability and non-probability samples. In particular, it shows the advantages and disadvantages of each type with respect to the selection mechanism, the population coverage, bias, variance, costs and timeliness. In general, the quality of non-probability samples suffers from an unknown selection mechanism (i.e., unknown probabilities of inclusion) and under-coverage of certain groups from the population (e.g., older people). As a result, direct estimates based on non-probability samples are biased and, in most cases, are characterized by small variance owing to their size. Certainly, the costs and timeliness of these surveys are significantly smaller than those of probability samples. Table: **Table 1.** A comparison of probability and non-probability samples and their characteristics. | *Factor* | *Probability sample* | *Non-probability sample* | |:----------------------|:------------------------|:----------------------------------------| | Selection | Known probabilities | Unknown probabilities (self-selection) | | Coverage | Complete | May be incomplete | | Estimation bias | Unbiased under design | Potential systematic bias | | Variance of estimates | Typically high | Typically low | | Cost | High | Low | | Timeliness | Long delay | Very short delay | To address this problem, several approaches have been proposed, which rely on the estimation of propensity scores (PS; i.e., inclusion probabilities) for deriving inverse probability weights (IPW; also known as propensity score weighting/adjustment, cf. @lee2006propensity [@lee2009estimation]), on model-based prediction (in particular, mass imputation estimators; MI) and on the doubly robust (DR) approach involving IPW and MI estimators. Two main scenarios are usually considered: 1. only population-level means or totals are available (e.g., from census, registers or sample surveys), 2. unit-level data are available either in the form of registers covering the whole population or in the form of probability surveys [cf. @elliott_inference_2017]. @wu2022statistical classified these approaches into three groups that require a joint randomization framework involving a probability sampling design (denoted as $p$) and an outcome regression model (denoted as $\xi$) or a PS model (denoted as $q$). According to this classification, IPW estimators represent the $qp$ framework, MI estimators represent the $\xi p$ framework, and DR estimators can represent either the $qp$ or the $\xi p$ framework. Most approaches assume that population data is available in order to reduce the bias of non-probability sampling by reweighting to reproduce known population totals/means (i.e., IPW estimators); by modeling the target variable using various techniques (i.e., MI estimators); or by combining both approaches (e.g., DR estimators, cf. @chen2020doubly; see also multilevel regression and post-stratification, MRP; Mister-P, cf. @gelman1997poststratification). This topic has become very popular and a number of new methods have been proposed; for instance non-parametric approaches based on nearest neighbors [@yang2021integration], kernel density estimation [@chen_nonparametric_2022], empirical likelihood [@kim2023empirical], model-calibration with LASSO [@chen2018] or quantile balanced IPW [@beresewicz2025] to name a few. It should be highlighted that, in contrast to probability samples, there is no single method that can be used for all non-probability samples. Based on the methods available in the literature several statistical software solutions have been developed. These pieces of software will be presented and compared with **nonprobsvy** [@nonprobsvy] in the next section. ## Software for non-probability samples Table 2 presents a comparison of selected packages in terms of the availability of different inference methods. The reason for this is the availability of software designed for inference based on non-probability samples, not a general way to estimate propensity scores or model $\mathbb{E}(Y \mid X)$. For such reviews, one can refer to @valliant2018survey [Chapter 6] who discussed weighting and matching in this context in Stata [@Stata2025] using `logit`, `svy` and `teffects nnmatch` commands or @lohr2021sas [e.g., Chapter 8] who covers SAS software [@sas94; e.g., `PROC LOGISTIC` procedure]. We focus on packages available as open source or via CRAN or PyPI (for non-CRAN or non-PyPI software see @cobo2024software). Our comparison includes five packages that focus specifically on non-probability sampling: in R -- **NonProbEst** [@NonProbEst] and our **nonprobsvy**, and in Python -- **balance** [@sarig2023balancepythonpackage], **inps** [@castro2024inps], and in Stan [@carpenter2017stan] e.g., through the R package **rstanarm**, which allows you to easily apply the **rstanarm**'s MRP method [@rstanarm]. Table: **Table 2.** A comparison of packages and implemented methods. | | Stan | balance (Python) | inps (Python) | NonProbEst (R) | nonprobsvy (R) | |:--------------------------------------|:----:|:----------------:|:-------------:|:--------------:|:--------------:| | IPW | -- | ✓ | ✓ | ✓ | ✓ | | Calibrated IPW | -- | -- | -- | -- | ✓ | | MI | -- | -- | -- | ✓ | ✓ | | DR | -- | -- | ✓ | -- | ✓ | | MRP | ✓ | -- | -- | -- | -- | | Variable selection | ✓ | ✓ | ✓ | ✓ | ✓ | | Analytical variance | -- | -- | -- | -- | ✓ | | Bootstrap variance | -- | -- | -- | ✓ | ✓ | | Integration with **survey**/**samplics** | -- | -- | -- | -- | ✓ | The **NonProbEst** package, while more limited than **nonprobsvy**, offers various techniques, such as PS or prediction approaches (e.g., model-calibrated). Users can choose several different settings for PS weights, variable selection and can estimate variance of the mean using the leave-one-out Jackknife procedure. Unfortunately, the package is no longer actively developed (as of August 2025) and some of the techniques are either outdated or have been shown to be inappropriate for non-probability samples. While the package contains functions designed for specific methods, it does not allow users to leverage the **survey** package [@survey-pkg] for estimation. The **balance** package is solely dedicated to the PS approach (as of version 0.10.0). It assumes that the reference probability sample is available and the authors have implemented the variance estimator of the weighted mean as a measure of uncertainty for the IPW estimator. The weights for the IPW estimator are constructed using the approach proposed by @Schonlau2017. The **inps** package supports the use of unit-level data from a probability sample or the population, implements IPW, MI and DR estimators, and offers users the possibility of selecting variables but is still at a very early stage of development. It also implements kernel weighting and a simple bootstrap approach via the **scipy.stats** package [@scipy2020]. Neither **balance** nor **inps** supports the use of the **samplics** package [@Diallo2021]. Finally, we note that the MRP approach is implemented solely in Stan with variable selection specified by an appropriate prior, but it can be easily implemented in other probabilistic programming languages, e.g., using the **Turing.jl** package [@turing] in the Julia language [@julia]. The **nonprobsvy** package has several advantages over those presented above. Firstly, it implements state-of-the-art methods recently proposed in the literature, along with valid statistical inference procedures. Secondly, it offers other approaches, such as calibrated IPW (where PS weights match either the population or estimated totals), NN and PMM matching, various IPW and DR estimators using generalized linear models (GLM) with different link functions for probability estimation. Thirdly, it supports the functions included in the **survey** package to account for the design of the probability sample. Finally, we provide a user-friendly API that mimics `glm()`, `survey::calibrate()` and other functions known in R, together with the main function to specify the approach and estimators. As far as we know, the **nonprobsvy** is the only software (open-access or commercial) that offers such functionalities. The remaining part of the paper is structured as follows. The next section is dedicated to the theory of statistical inference based on non-probability samples. We provide the basic set-up and introduce specific methods in separate sections. We follow the notation used by @wu2022statistical throughout the paper. We then describe the main function and the package functionalities, and present an empirical study showcasing the process of integrating data from the Polish Job Vacancy Survey with voluntary administrative data from the Central Job Offers Database in order to estimate the number of companies with at least one vacancy offered on a single shift. We then present classes and S3 methods implemented in the package. The paper ends with a summary and plans for future development. The Appendix presents algorithms for selected MI estimators. The package **nonprobsvy** [@nonprobsvy] is available from the Comprehensive R Archive Network (CRAN) at . # Methods for non-probability samples ## Basic setup Let $U = \{1, \ldots, N\}$ denote the target population consisting of $N$ labeled units. Each unit $i$ has an associated $L$-dimensional vector of auxiliary variables $\boldsymbol{x}_{i}$ and the study (target) variable $y_{i}$. Let $\{ (y_i, \boldsymbol{x}_i), i \in \mathcal{I}_{\text{NP}}\}$ be a non-probability sample $S_{\text{NP}}$ of size $n_{\text{NP}}$, where $\mathcal{I}_{\text{NP}}$ is an index set for $S_{\text{NP}}$. Let $\left\{\left(\boldsymbol{x}_i, \pi_{\text{P},i}\right), i \in \mathcal{I}_{\text{P}}\right\}$ be a probability sample $S_{\text{P}}$ of size $n_{\text{P}}$, where $\mathcal{I}_{\text{P}}$ is an index set for $S_{\text{P}}$, and the only information known for all units in the population refers to auxiliary variables $\boldsymbol{x}$ and inclusion probabilities $\pi_{\text{P}}$. Each unit in the $S_{\text{P}}$ sample has been assigned a design-based weight given by $d_{\text{P},i} = 1/\pi_{\text{P},i}$. Let $I_{\text{NP},i}=I(i \in S_{\text{NP}})$ and $I_{\text{P},i}=I(i \in S_{\text{P}})$ be indicators of inclusion in the non-probability sample $S_{\text{NP}}$ and the probability sample $S_{\text{P}}$, respectively, which are defined for all units in the target population. Let $\pi_{\text{NP},i}=\mathbb{P}(I_{\text{NP},i}=1 \mid \boldsymbol{x}_i, y_i)=\mathbb{P}(I_{\text{NP},i}=1 \mid \boldsymbol{x}_i)$ be propensity scores, which characterize the $S_{\text{NP}}$ sample's inclusion and participation mechanisms. Unlike $\pi_{\text{P},i}$, the $\pi_{\text{NP},i}$ and $d_{\text{NP},i}=1/\pi_{\text{NP},i}$ are unknown. The description of the data is presented in a more concise form in Table 3. Table: **Table 3.** Two-sample setting with no overlap between samples. | Sample | ID | Inclusion ($I_{\text{NP}}$) | Design weight ($d$) | Covariates ($\boldsymbol{x}$) | Study variable ($y$) | |:----------------------|:------------------|:---------------------------:|:-------------------:|:-----------------------------:|:--------------------:| | Non-probability | 1 | 1 | ? | ✓ | ✓ | | $S_{\text{NP}}$ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | | | $n_{\text{NP}}$ | 1 | ? | ✓ | ✓ | | Probability | $n_{\text{NP}}+1$ | 0 | ✓ | ✓ | ? | | $S_{\text{P}}$ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | | | $n_{\text{NP}}+n_{\text{P}}$ | 0 | ✓ | ✓ | ? | The goal is to estimate the finite population mean $\mu_{y} = N^{-1}\sum_{i=1}^{N} y_{i}$ of the target variable $y$. As values of $y_{i}$ are not observed in the probability sample, they cannot be used to estimate the target quantity. Instead, one could try combining the non-probability and probability samples (or known population totals) to estimate $\mu_{y}$. Given the absence of a universally accepted method for achieving this objective, assumptions vary considerably, as outlined by @wu2022statistical. However, the following are the main assumptions that apply to the majority of methods presented in this section: * **A1** $I_{\text{NP}, i}$ and the study variable $y_{i}$ are independent given the set of covariates $\boldsymbol{x}_{i}$ (the missing at random mechanism). * **A2** All the units in the target population have non-zero PS, i.e., $\pi_{\text{NP}, i}>0$, $i=1, 2, \ldots, N$ (i.e., no coverage error). * **A3** The indicator variables $I_{\text{NP}, 1}, I_{\text{NP}, 2}, \ldots, I_{\text{NP}, N}$ are independent given the set of auxiliary variables $\left(\boldsymbol{x}_1, \boldsymbol{x}_2, \ldots, \boldsymbol{x}_N\right)$ (i.e., no clustering). Currently, we ignore overlap between $S_{\text{NP}}$ and $S_{\text{P}}$, and assume no measurement error in $y_i$ and the fact that values of $\boldsymbol{x}_i$ are known. The setting presented in Table 3 can also be extended to calibrated $d_{\text{P}, i}$ weights (i.e., $d_{\text{P}, i}$ adjusted for under-coverage, non-contact or non-response; cf. @sarndal2005estimation) but this requires additional developments in the theory about the consistency of the MI, IPW and DR estimators. In the next sections we briefly present the methods implemented in the package. ## Prediction-based approach ### Prediction estimators In the prediction approach the following semi-parametric model for the finite population is assumed: $$ \mathbb{E}_{\xi}\left(y_i \mid \boldsymbol{x}_i\right)=m\left(\boldsymbol{x}_i, \boldsymbol{\beta}\right), \text { and } \quad V_{\xi}\left(y_i \mid \boldsymbol{x}_i\right)=v\left(\boldsymbol{x}_i\right) \sigma^2, \quad i=1, 2, \ldots, N, \tag{1} $$ where the mean function $m(\cdot, \cdot)$ and the variance $v(\cdot)$ have known forms, $\boldsymbol{\beta}$ and $\sigma^2$ are model parameters and $y_i$ are also assumed to be conditionally independent given the $\boldsymbol{x}_i$. The model in Equation 1 is assumed to hold for all units in the non-probability sample $S_{\text{NP}}$. The parameters of the model in Equation 1 can be estimated using the quasi maximum likelihood estimation method, which includes linear and non-linear models, such as GLM. Let $\boldsymbol{\beta}_0$ and $\sigma^2_0$ be the true values of the model parameters $\boldsymbol{\beta}$ and $\sigma^2$ under the adopted model and $\hat{\boldsymbol{\beta}}$ be the quasi maximum likelihood estimator of $\boldsymbol{\beta}_0$. Let $m_i=m(\boldsymbol{x}_i, \boldsymbol{\beta}_0)$ and $\hat{m}_i=m(\boldsymbol{x}_i, \hat{\boldsymbol{\beta}})$ be calculated for all units $i=1, \ldots, N$. Under this setting, as @wu2022statistical notes, there are two commonly used prediction estimators: $$ \hat{\mu}_{y, \text{PR1}}=\frac{1}{N} \sum_{i=1}^N \hat{m}_i \quad \text { and } \quad \hat{\mu}_{y, \text{PR2}}=\frac{1}{N}\left\{\sum_{i \in S_{\text{NP}}} y_i-\sum_{i \in S_{\text{NP}}} \hat{m}_i+\sum_{i=1}^N \hat{m}_i\right\}. \tag{2} $$ Under linear models, where $m(\boldsymbol{x}_i, \boldsymbol{\beta})=\boldsymbol{x}_i^{\top}\boldsymbol{\beta}$, the two estimators in Equation 2 reduce to: $$ \hat{\mu}_{y, \text{PR1}}=\mu_{\boldsymbol{x}}^{\top} \hat{\boldsymbol{\beta}} \quad \text { and } \quad \hat{\mu}_{y, \text{PR2}}=\frac{n_{\text{NP}}}{N}\left(\overline{y}_{\text{NP}}-\overline{\boldsymbol{x}}_{\text{NP}}^{\top} \hat{\boldsymbol{\beta}}\right)+\mu_{\boldsymbol{x}}^{\top} \hat{\boldsymbol{\beta}}, \tag{3} $$ where $\mu_{\boldsymbol{x}} = N^{-1}\sum_{i=1}^N\boldsymbol{x}_i$ is the vector of the population means of the $\boldsymbol{x}$ variables and $\overline{\boldsymbol{x}}_{\text{NP}}=n_{\text{NP}}^{-1}\sum_{i \in S_{\text{NP}}}\boldsymbol{x}_i$ is the vector of the simple means of $\boldsymbol{x}$ from the non-probability $S_{\text{NP}}$ sample. If the linear model contains an intercept and $\hat{\boldsymbol{\beta}}$ is the ordinary least squares estimator, then $\hat{\mu}_{y, \text{PR1}}=\hat{\mu}_{y, \text{PR2}}$. The estimator in Equation 3 is appealing as it only requires the non-probability sample $S_{\text{NP}}$ and reference population means (or totals and population size $N$). If the population means are unknown, they can be replaced by estimates provided by the reference probability sample $S_{\text{P}}$, i.e., $\sum_{i=1}^N \hat{m}_i$ is replaced with $\sum_{i \in S_{\text{P}}} d_{\text{P}, i}\hat{m}_i$ for Equation 2 and $\mu_{\boldsymbol{x}}$ is replaced by $\hat{\mu}_{\boldsymbol{x}}=\hat{N}_{\text{P}}^{-1}\sum_{i \in S_{\text{P}}}d_{\text{P}, i}\boldsymbol{x}_i$ for Equation 3 where $\hat{N}_{\text{P}}=\sum_{i \in S_{\text{P}}}d_{\text{P}, i}$. ### Mass imputation estimators Model-based prediction estimators of $\mu$ can be treated as mass imputation estimators, since the information on $y_i$ is missing entirely in the reference probability sample $S_{\text{P}}$ (but $\boldsymbol{x}_i$ is available) and $y_i$ can be imputed based on the non-probability sample as values of $\{ (y_i, \boldsymbol{x}_i), i \in \mathcal{I}_{\text{NP}}\}$ are known. The general form of the MI estimator is given by: $$ \hat{\mu}_{y, \text{MI}}=\frac{1}{\hat{N}_{\text{P}}} \sum_{i \in S_{\text{P}}} d_{\text{P}, i} y_i^*, $$ where $y_i^*$ is the imputed value of $y_i$ and $\hat{N}_{\text{P}}$ is defined as previously. Under deterministic regression imputation, the $\hat{\mu}_{y, \text{MI}}$ estimator reduces to the estimators in Equation 2. There are several ways of imputing $y_i^*$ and in the **nonprobsvy** package we have implemented the following MI estimators: the semi-parametric approach based on GLM (MI-GLM), kernel smoothing (MI-NPAR), nearest neighbor matching (MI-NN) and predictive mean matching (MI-PMM). The properties of the MI-GLM estimator, where $y_i^*$ are imputed with $\hat{m}_i$ from the semi-parametric model, were studied by @kim_combining_2021. In the **nonprobsvy** package, we account for the following GLM families: `gaussian()`, `binomial()` and `poisson()`. The MI-NPAR estimator was proposed by @chen_nonparametric_2022 who studied its finite population properties. Currently, we implement this approach using local polynomial regression using `stats::loess()` function. In the next releases we will allow users to use the **np** [@nppkg] or **KernSmooth** [@KernSmooth] packages. The MI-NN estimator was initially proposed by @rivers2007sampling under the name sample matching and theoretical properties of the MI-NN estimator for large non-probability samples (big data, i.e., covering a significant part of the target population) were studied by @yang2021integration. The basic idea of NN matching is as follows: 1. for each unit $i$ in the probability sample $S_{\text{P}}$ find a donor $j$ (or multiple donors) in the $S_{\text{NP}}$ sample based on some distance between $\boldsymbol{x}_i$ and $\boldsymbol{x}_j$, 2. use the matched values $y_j$ from $S_{\text{NP}}$ to impute missing $y_i$ values in the probability sample $S_{\text{P}}$. Imputed values of $y_i^*$ depend on the number of selected $k$ neighbors: for $k=1$, the closest unit is selected, and for $k>1$, one can calculate a simple average over a vector of selected $y$ values. A detailed description of the procedure is presented in Algorithm 1 in the Appendix. The MI-NN estimator suffers from the curse of dimensionality, i.e., asymptotic bias of the MI estimator increases as the number of covariates $\boldsymbol{x}$ increases with a fixed $k$ [@abadie2006large; @yang_asymptotic_2020]. The single PMM imputation approach has been proposed to overcome this issue [@chlebicki2025]. In the PMM approach, matching is done using predicted values of $\hat{m}_i=m\left(\boldsymbol{x}_i, \hat{\boldsymbol{\beta}}\right)$ instead of $\boldsymbol{x}_i$, thus the NN algorithm is modified as follows: 1. fit the model $m\left(\boldsymbol{x}_i, \boldsymbol{\beta}\right)$ to non-probability $S_{\text{NP}}$ sample, 2. assign predicted values $\hat{m}_i$ to all units in $S_{\text{NP}}$ and $S_{\text{P}}$, 3. match all units from the $S_{\text{P}}$ sample to donor units from the $S_{\text{NP}}$ sample based on $\hat{m}$ values. The MI-PMM estimator is the same as in the NN approach but differs from the multiple imputation PMM implemented in the **mice** package [@mice]. In **nonprobsvy** we have implemented a single imputation method (i.e., one dataset is created). @chlebicki2025 studied properties of two variants of the MI-PMM estimator for non-probability samples: matching predicted to predicted ($\hat{m}-\hat{m}$ matching; denoted as MI-PMM-A) and matching predicted to observed ($\hat{m}-y$ matching; denoted as MI-PMM-B). Details of the procedure can be found in Algorithm 2 and Algorithm 3 in the Appendix. @chlebicki2025 also prove the consistency of the MI-PMM-A estimator under model misspecification, i.e., the assumed model may differ from the true one. ### Variance estimators for the prediction approach Variance of the MI estimators can be estimated analytically or using the bootstrap approach. The analytical estimator of the variance of the MI-GLM estimator proposed by @kim_combining_2021 [p. 950] contains two components: $\hat{V}_1$ (based on the information from both samples $S_{\text{NP}}$ and $S_{\text{P}}$) and $\hat{V}_2$ (based exclusively on the probability sample $S_{\text{P}}$). For the MI-NN estimator @yang2021integration proposed a variance estimator for large $S_{\text{NP}}$ samples, which reduces to the part for the probability sample $S_{\text{P}}$ (i.e., the design-based variance estimator of the mean, which can easily be obtained from the **survey** package) and a version for smaller samples can be found in @chlebicki2025. With respect to the variance of MI-PMM estimators, @chlebicki2025 propose formulas which are the same as those for the MI-NN estimators. The analytical variance estimator for the MI-NPAR was proposed by @chen_nonparametric_2022. In the bootstrap approach each bootstrap replication $b=1, \ldots, B$ consists of the following steps. 1. Independently: * Draw a simple random sampling with replacement from the non-probability sample $S_{\text{NP}}$ (using `base::sample()` function). * Draw a sample according to the declared sampling design from the probability sample $S_{\text{P}}$ (e.g., one can use the `as.svrepdesign()` function from the **survey** package). 2. Estimate $\mu_{y, \text{MI}}^b$ using an appropriate approach (e.g., MI-GLM, MI-NN or MI-PMM). After obtaining $B$ bootstrap replicates, estimate variance using the following equation: $$ \hat{V}_{\text{boot}} = \frac{1}{B-1}\sum_{b=1}^B\left(\hat{\mu}^b_{y, \text{MI}} - \hat{\mu}_{y, \text{MI}}\right)^2. \tag{4} $$ The above approaches are applied when unit-level data from the probability sample $S_{\text{P}}$ are available. If this is not the case and only population means (or totals and population size) are available, we can estimate the variance of the $\mu_{y, \text{MI-GLM}}$ estimator using the first component $\hat{V}_1$ of the @kim_combining_2021 variance estimator (replaced by the survey-based population quantities, if available). To estimate the variance of the MI-NN and MI-PMM estimators we only allow the bootstrap approach with known population means. Note that the current version of the **nonprobsvy** does not support the use of replicated weights in the probability sample $S_{\text{P}}$ for any of the estimators discussed in this paper. ## Inverse probability weighting Inverse probability weighting, another popular estimation approach, involves estimating PS given by $\pi_{\text{NP}, i}=\mathbb{P}\left(i \in S_{\text{NP}} \right)$. As in the case of the prediction-based approach, there are two variants of the IPW estimator, given by: $$ \hat{\mu}_{y, \text{IPW-HT}}=\frac{1}{N} \sum_{i \in S_{\text{NP}}} \frac{y_i}{\hat{\pi}_{\text{NP}, i}} \quad \text{and} \quad \hat{\mu}_{y, \text{IPW-Hájek}}=\frac{1}{\hat{N}_{\text{NP}}} \sum_{i \in S_{\text{NP}}} \frac{y_i}{\hat{\pi}_{\text{NP}, i}}, \tag{5} $$ where the $\hat{\mu}_{y, \text{IPW-HT}}$ is an adjustment of the Horvitz-Thompson estimator, and the $\hat{\mu}_{y, \text{IPW-Hájek}}$ is an adjustment of the Hájek estimator, where the estimated population size is given by $\hat{N}_{\text{NP}} = \sum_{i \in S_{\text{NP}}} \hat{\pi}_{\text{NP}, i}^{-1}$. The use of this estimator with respect to non-probability samples is discussed by @lee2006propensity and @biffignandi2021handbook [Chapter 13] and there are several approaches to using propensity scores along with alternative versions of the weights [cf. @elliott_inference_2017, Section 3]. In an article by @chen2020doubly dedicated to the properties of the estimators in Equation 5, the authors proved their consistency and derived their closed form versions. @wu2022statistical [Section 4.2] argues that the $\hat{\mu}_{y, \text{IPW-Hájek}}$ estimator performs better than $\hat{\mu}_{y, \text{IPW-HT}}$ even if the population size is known. The construction of the IPW estimator involves two steps: 1. estimating the PS, 2. deriving $d_{\text{NP}, i}$ weights (which, in our case, are equal to $\pi_{\text{NP}, i}^{-1}$) To estimate the propensity scores $\pi_{\text{NP}, i}=\pi(\boldsymbol{x}_i, \boldsymbol{\gamma})$ one can use the likelihood approach assuming that the information about $\boldsymbol{x}_i$ is available for each unit in the population given by $$ \begin{aligned} \ell(\boldsymbol{\gamma}) &= \log\left\{\prod_{i=1}^N \left(\pi_{\text{NP}, i}\right)^{I_{\text{NP}, i}}\left(1-\pi_{\text{NP}, i}\right)^{1-I_{\text{NP}, i}}\right\} \\ &= \sum_{i \in S_{\text{NP}}} \log \left\{\frac{\pi\left(\boldsymbol{x}_i, \boldsymbol{\gamma}\right)}{1-\pi\left(\boldsymbol{x}_i, \boldsymbol{\gamma}\right)}\right\}+\sum_{i=1}^N \log \left\{1-\pi\left(\boldsymbol{x}_i, \boldsymbol{\gamma}\right)\right\}. \end{aligned} \tag{6} $$ In practice, a function of this form cannot be used because not all units from the population are observed. A more realistic approach consists in using a reference probability sample $S_{\text{P}}$, which means that the second component of the equation in Equation 6 is replaced, yielding a pseudo log-likelihood function given by $$ \ell^*(\boldsymbol{\gamma}) = \sum_{i \in S_{\text{NP}}} \log \left\{\frac{\pi\left(\boldsymbol{x}_i, \boldsymbol{\gamma}\right)}{1-\pi\left(\boldsymbol{x}_i, \boldsymbol{\gamma}\right)}\right\}+ \sum_{i \in S_{\text{P}}} d_{\text{P}, i} \log \left\{1-\pi\left(\boldsymbol{x}_i, \boldsymbol{\gamma}\right)\right\}. $$ The maximum pseudo-likelihood estimator $\hat{\boldsymbol{\gamma}}$ can be obtained as the solution to the pseudo score equation, which, under the logit function assumed for $\pi_{\text{NP}, i}$, is given by $$ \boldsymbol{U}(\boldsymbol{\gamma}) = \sum_{i \in S_{\text{NP}}} \boldsymbol{x}_i - \sum_{i \in S_{\text{P}}} d_{\text{P}, i} \pi(\boldsymbol{x}_i, \boldsymbol{\gamma}) \boldsymbol{x}_i. \tag{7} $$ In general, pseudo score functions $\boldsymbol{U}(\boldsymbol{\gamma})$ for true values of model parameters $\boldsymbol{\gamma}_0$ are unbiased under the joint $q p$ randomization in the sense that $\mathbb{E}_{q p}\left\{\boldsymbol{U}\left(\boldsymbol{\gamma}_0\right)\right\}=\boldsymbol{0}$, which implies that the estimator $\hat{\boldsymbol{\gamma}}$ is consistent for $\boldsymbol{\gamma}_0$ [@wu2022statistical]. The terms in Equation 7 can be replaced by general estimation equations. Let $\boldsymbol{h}(\boldsymbol{x}, \boldsymbol{\gamma})$ be a user-specified vector of functions with the same dimension of $\boldsymbol{\gamma}$ and $\boldsymbol{G}(\boldsymbol{\gamma})$ is defined as $$ \boldsymbol{G}(\boldsymbol{\gamma})=\sum_{i \in S_{\text{NP}}} \boldsymbol{h}\left(\boldsymbol{x}_i, \boldsymbol{\gamma}\right)-\sum_{i \in S_{\text{P}}} d_{\text{P}, i} \pi\left(\boldsymbol{x}_i, \boldsymbol{\gamma}\right) \boldsymbol{h}\left(\boldsymbol{x}_i, \boldsymbol{\gamma}\right), $$ then solving for $\boldsymbol{G}(\boldsymbol{\gamma})=\boldsymbol{0}$ with the chosen parametric form of $\pi_{\text{NP}, i}$ and the chosen $\boldsymbol{h}(\boldsymbol{x}, \boldsymbol{\gamma})$ produces the consistent estimator of $\hat{\boldsymbol{\gamma}}$. In the literature, the most commonly considered functions are $\boldsymbol{h}\left(\boldsymbol{x}_i, \boldsymbol{\gamma}\right) = \boldsymbol{x}_i$ and $\boldsymbol{h}\left(\boldsymbol{x}_i, \boldsymbol{\gamma}\right) = \boldsymbol{x}_i \pi\left(\boldsymbol{x}_i, \boldsymbol{\gamma}\right)^{-1}$. Note that if the function $\boldsymbol{h}\left(\boldsymbol{x}_i, \boldsymbol{\gamma}\right)=\boldsymbol{x}_i$, then $\boldsymbol{G}$ reduces to $\boldsymbol{U}$ and for the second variant of the $\boldsymbol{h}$ function we get $$ \boldsymbol{G}(\boldsymbol{\gamma}) = \sum_{i \in S_{\text{NP}}} \frac{\boldsymbol{x}_i}{\pi\left(\boldsymbol{x}_i, \boldsymbol{\gamma}\right) }-\sum_{i \in S_{\text{P}}} d_{\text{P}, i} \boldsymbol{x}_{i}, \tag{8} $$ which can be viewed as calibrated IPW. Equation 8 only requires the knowledge of population totals for auxiliary variables $\boldsymbol{x}$. Moreover, the use of Equation 8 yields a doubly robust estimator under the assumption that the outcome model is linear [cf. @kim_theory_2012]. ### Variance estimators for the inverse probability weighting approach @chen2020doubly [Section 3.2] derived asymptotic variance estimators for both IPW estimators in Equation 5 and presented the plug-in variance estimator for the $\hat{\mu}_{y, \text{IPW-Hájek}}$ estimator assuming logistic regression. In the package we have implemented this approach for `"logit"`, `"probit"` and `"cloglog"` link functions of the `binomial()` family. We refer the reader to @wu2022statistical [Section 6.2] and @chrostowski2024statistical [Chapter 3] for more details on how these estimators are derived based on the general estimating equations approach. Another approach is to use a bootstrap, which is essentially the same as the one presented in Equation 4, where $\hat{\mu}_{y}$ is replaced by one of the estimators in Equation 5. ## Doubly robust approach The IPW and MI estimators are suited to correctly specified models for PS and outcome regression models, respectively. The DR approach was proposed to improve robustness and efficiency [cf. @robins1994estimation]. It incorporates a prediction model for the response variable $y_i$ and a PS model for participation $I_{\text{NP}, i}$. This approach is doubly robust in the sense that the DR estimator remains consistent even if one of the models is misspecified. We need to consider a joint randomization approach involving a non-probability sample $S_{\text{NP}}$ and a probability sample $S_{\text{P}}$ and DR inference is conducted within the $qp$ or $\xi p$ framework without specifying which one is correct. The general formula for the DR estimator is given by $$ \tilde{\mu}_{y, \text{DR}} = \frac{1}{N}\sum_{i \in S_{\text{NP}}}\frac{y_i - m_i}{\pi_{\text{NP}, i}} + \frac{1}{N}\sum_{i =1}^N m_i, $$ where $\pi_{\text{NP}, i}$ and $m_i$ are defined as previously. In the next sections we discuss two approaches to the DR estimation. ### Parameters estimated separately @chen2020doubly proposed two DR estimators given in Equation 9 and Equation 10 assuming that the population size is either known or estimated: $$ \hat{\mu}_{y, \text{DR-HT}}=\frac{1}{N} \sum_{i \in S_{\text{NP}}} d_{\text{NP}, i}\left\{y_i-m\left(\boldsymbol{x}_i, \hat{\boldsymbol{\beta}}\right)\right\}+\frac{1}{N} \sum_{i \in S_{\text{P}}} d_{\text{P}, i} m\left(\boldsymbol{x}_i, \hat{\boldsymbol{\beta}}\right), \tag{9} $$ and $$ \hat{\mu}_{y, \text{DR-Hájek}}=\frac{1}{\hat{N}_{\text{NP}}} \sum_{i \in S_{\text{NP}}} d_{\text{NP}, i}\left\{y_i-m\left(\boldsymbol{x}_i, \hat{\boldsymbol{\beta}}\right)\right\}+\frac{1}{\hat{N}_{\text{P}}} \sum_{i \in S_{\text{P}}} d_{\text{P}, i} m\left(\boldsymbol{x}_i, \hat{\boldsymbol{\beta}}\right), \tag{10} $$ where $d_{\text{NP}, i}=\pi\left(\boldsymbol{x}_i, \hat{\boldsymbol{\gamma}}\right)^{-1}$, $\hat{N}_{\text{NP}}=\sum_{i \in S_{\text{NP}}} d_{\text{NP}, i}$ and $\hat{N}_{\text{P}}=\sum_{i \in S_{\text{P}}} d_{\text{P}, i}$. The estimator in Equation 10, including the estimated population size, has a better performance in terms of bias and the mean squared error and should be used in practice. However, the main limitation is associated with variance estimation, which is discussed at the end of this section. @chen2020doubly suggested constructing the estimators in Equation 9 or 10 based on the two models estimated separately, which is an attractive option, since one can specify a different number of variables for the propensity and outcome model. An alternative approach, proposed by @yang_doubly_2020, and similar to the one described by @kim2014doubly, is discussed in the next section. ### Minimization of the bias for doubly robust methods @yang_doubly_2020 discussed variable selection for a high-dimensional setting and noted that the asymptotic bias of the estimator, which can increase, cannot be controlled. Therefore, according to @yang_doubly_2020, the idea is to develop equations that can be used to estimate the $\boldsymbol{\beta}$ and $\boldsymbol{\gamma}$ parameters based on the bias of the population mean estimator. In this way the parameters can be estimated in a single step, rather than in two separate steps. First, the authors derived the asymptotic bias of the $\hat{\mu}_{\text{DR}}$, assuming $\boldsymbol{h}(\boldsymbol{x}_i, \boldsymbol{\gamma})=\boldsymbol{x}_i\pi(\boldsymbol{x}_i, \boldsymbol{\gamma})^{-1}$ for the IPW estimator, which is given by $$ \begin{aligned} \operatorname{bias}\left(\hat{\mu}_{\text{DR}}\right) = \mathbb{E}\lvert\hat{\mu}_{\text{DR}}-\mu\rvert &= \mathbb{E}\left\{\frac{1}{N} \sum_{i=1}^N\left\{\frac{I_{\text{NP}, i}}{\pi\left(\boldsymbol{x}_i, \boldsymbol{\gamma}\right)} - 1\right\}\left\{y_i-m\left(\boldsymbol{x}_i, \boldsymbol{\beta}\right)\right\} \right\}\\ & + \mathbb{E}\left\{\frac{1}{N} \sum_{i=1}^N\left(I_{\text{P}, i} d_{\text{P}, i}-1\right) m\left(\boldsymbol{x}_i, \boldsymbol{\beta}\right)\right\}. \end{aligned} $$ The goal of this approach is to minimize $\operatorname{bias}\left(\hat{\mu}_{\text{DR}}\right)^2$, which consists in solving the following system of empirical equations: $$ \left(\begin{array}{c} \sum_{i=1}^N I_{\text{NP}, i}\left\{\frac{1}{\pi\left(\boldsymbol{x}_i, \boldsymbol{\gamma}\right)}-1\right\}\left\{y_i-m\left(\boldsymbol{x}_i, \boldsymbol{\beta}\right)\right\} \boldsymbol{x}_i \\ \sum_{i=1}^N \frac{I_{\text{NP}, i}}{\pi\left(\boldsymbol{x}_i, \boldsymbol{\gamma}\right)} \dot{m}\left(\boldsymbol{x}_i, \boldsymbol{\beta}\right) -\sum_{i \in S_{\text{P}}} d_{\text{P}, i} \dot{m}\left(\boldsymbol{x}_i, \boldsymbol{\beta}\right) \end{array}\right) = \boldsymbol{0}, \tag{11} $$ where $\dot{m}\left(\boldsymbol{x}_i, \boldsymbol{\beta}\right)=\frac{\partial m\left(\boldsymbol{x}_i, \boldsymbol{\beta}\right)}{\partial \boldsymbol{\beta}}$. The system in Equation 11 can be solved using the Newton--Raphson method. This approach, without variable selection, is equivalent to that proposed by @kim2014doubly and was extensively discussed by @chen2020doubly and @wu2022statistical in the context of estimating parameters and the variance of the DR estimator. The main limitation of this approach is the possibility that a solution to Equation 11 may not exist unless the two sets of covariates used in the outcome regression model and the PS model have the same dimensions. That is why @yang_doubly_2020 suggested using this approach on the union of variables from both models (e.g., after variable selection). In the **nonprobsvy** package we have implemented these approaches not only for $\boldsymbol{h}(\boldsymbol{x}_i, \boldsymbol{\gamma})=\boldsymbol{x}_i \pi\left(\boldsymbol{x}_i, \boldsymbol{\gamma}\right)^{-1}$ but also for $\boldsymbol{h}(\boldsymbol{x}_i, \boldsymbol{\gamma})=\boldsymbol{x}_i$ and various link functions for the propensity score model. As noted in the beginning, the choice of either Equation 9 or 10 results in a different approach to estimating variance, which is discussed in the next section. ### Variance estimators for the doubly robust approach @yang_doubly_2020 derived a closed form estimator for Equation 9 but this requires the knowledge of the population and bias correction to obtain a valid estimator for $V_{\xi p}\left(\hat{\mu}_{y, \text{DR}}-\mu_y\right)$ under the outcome regression model $\xi$. A doubly robust variance estimator for $\hat{\mu}_{y, \text{DR-Hájek}}$ given by Equation 10 is not yet available in the literature. In the package, to offer the analytical variance estimator of $\hat{\mu}_{y, \text{DR-Hájek}}$ we simply replace $N$ with estimated $\hat{N}_{\text{NP}}$ and $\hat{N}_{\text{P}}$ but we urge caution when using this approach. Alternatively, one can use the bootstrap approach. @chen2020doubly demonstrated that the bootstrap approach presented in the prediction approach section performs well in terms of the coverage rate when one of the working models is correctly specified. This is why this approach is recommended for all users. ## Variable selection algorithms @yang_asymptotic_2020 point out that it is crucial to use variable selection techniques during estimation, especially when dealing with high-dimensional non-probability samples. Variable selection not only improves model stability and computational feasibility, but also reduces variance, which can increase when irrelevant auxiliary variables are included. The most popular approaches described in the literature are penalization methods, such as least absolute shrinkage and selection operator (LASSO), smoothly clipped absolute deviation (SCAD) or minimax concave penalty (MCP), which, thanks to the appropriate loss functions, shrink the coefficients in variables that have no significant effect on the dependent variable [cf. @tibshirani1996regression; @ncvreg]. The selection procedure for non-probability methods works in a similar way, with loss functions modified to account for external data sources, such as sample or population totals or averages. In particular, the technique consists of two steps: 1. select the relevant variables using an appropriately constructed loss function (and possibly using the approach shown in Equation 11 to obtain the final estimates of the model parameters), 2. construct the selected estimator using variables selected from the first step. For instance, @yang_doubly_2020 used Equation 12 as a loss function for estimating outcome equation parameters: $$ \operatorname{Loss}\left(\lambda_{\boldsymbol{\beta}}\right)=\sum_{i=1}^N I_{\text{NP}, i}\left[y_i-m\left\{\boldsymbol{x}_i, \boldsymbol{\beta}(\lambda_{\boldsymbol{\beta}})\right\}\right]^2, \tag{12} $$ where $m\left\{\boldsymbol{x}_i, \boldsymbol{\beta}(\lambda_{\boldsymbol{\beta}})\right\}$ is the penalized function for the $\boldsymbol{\beta}$ parameters with a tuning parameter $\lambda_{\boldsymbol{\beta}}$ and loss functions for the PS function presented in Table 4, where $\lambda_{\boldsymbol{\gamma}}$ is the tuning parameter and $\boldsymbol{x}_{i, l}$ denotes $l$-th component of $\boldsymbol{x}_{i}$. Table: **Table 4.** Loss functions for the PS function depending on the $\boldsymbol{h}(\cdot, \cdot)$ function. | $\boldsymbol{h}(\boldsymbol{x}_i, \boldsymbol{\gamma})$ function | Loss function | |:---|:---| | $\boldsymbol{x}_i$ | $\sum_{l=1}^L\left(\sum_{i=1}^N\left[I_{\text{NP}, i} - \frac{I_{\text{P}, i}\pi\left\{\boldsymbol{x}_i, \boldsymbol{\gamma}(\lambda_{\boldsymbol{\gamma}})\right\}}{\pi_{\text{P}, i}}\right] \boldsymbol{x}_{i, l}\right)^2$ | | $\boldsymbol{x}_i \pi(\boldsymbol{x}_i, \boldsymbol{\gamma})^{-1}$ | $\sum_{l=1}^L\left(\sum_{i=1}^N\left[\frac{I_{\text{NP}, i}}{\pi\left\{\boldsymbol{x}_i, \boldsymbol{\gamma}(\lambda_{\boldsymbol{\gamma}})\right\}}-\frac{I_{\text{P}, i}}{\pi_{\text{P}, i}}\right] \boldsymbol{x}_{i, l}\right)^2$ | @yang_doubly_2020 only discussed the SCAD penalty and the $\boldsymbol{h}(\boldsymbol{x}_i, \boldsymbol{\gamma})=\boldsymbol{x}_i$ function for the DR estimator. In the **nonprobsvy** package we have extended this approach to the first variant of $\boldsymbol{h}(\boldsymbol{x}_i, \boldsymbol{\gamma})$, shown in the first row of Table 4, and have allowed the user to select other link functions for the $\pi_{\text{NP}, i}$, implemented other penalty functions and extended the possibility of selecting variables to MI and IPW estimators. In the next section we discuss how to define the approaches presented above. # The main function and the package functionalities ## The `nonprob` function The **nonprobsvy** package is built around the `nonprob()` function. The main design objective was to make the use of `nonprob` as similar as possible to standard R functions for fitting statistical models, such as `stats::glm()`, while incorporating survey design features from the **survey** package. The most important arguments are given in Table 5 and the obligatory ones include `data` as well as one of the following three -- `selection`, `outcome`, or `target` -- depending on which method has been selected. In the case of `outcome` and `target` multiple $y$ variables can be specified. Table: **Table 5.** A description of the `nonprob()` function arguments. | Argument | Description | |:---|:---| | `data` | A `data.frame` with data from the non-probability sample of size $n_{\text{NP}}$. The data frame must contain all variables referenced in the `selection`, `outcome`, and `target` formulas. | | `selection` | A `formula` for the selection equation (e.g., `~ x1 + x2`). This formula defines the PS model (for the IPW estimator). | | `outcome` | A `formula` for the outcome equation (e.g., `y1 + y2 ~ x1 + x2`). We allow for multiple target variables under the same prediction model (for the MI and DR estimator). | | `target` | A `formula` with target variables (e.g., `~ y1 + y2 + y3`). | | `svydesign` | An optional `svydesign2` object containing the probability sample of size $n_{\text{P}}$. Must include all variables specified in `selection`, `outcome`, and `target` formulas. Used for calibration and validation of the non-probability sample estimates. | | `pop_totals`, `pop_means`, `pop_size` | An optional named `vector` with population totals or means of the covariates and population size. For `pop_totals` and `pop_means`, length must equal the number of covariates in the selection model, with names matching variable names in the `data` frame. | | `method_selection` | A link function for the PS (`c("logit", "probit", "cloglog")`). | | `method_outcome` | Specification of the MI approach (`c("glm", "nn", "pmm", "npar")`). | | `family_outcome` | A GLM family for the MI approach (`c("gaussian", "binomial", "poisson")`). | | `subset` | An optional `vector` specifying a subset of observations to be used in the fitting process. | | `strata` | An optional parameter for estimation for sub-populations (currently not supported). | | `case_weights` | An optional `vector` of prior case (frequency) weights to be used in the fitting process. | | `na_action` | A `function` indicating what to do with `NA`'s (default `na.omit`). | | `control_selection`, `control_outcome`, `control_inference` | Control functions with parameters for PS, the outcome model and variance estimation, respectively. | | `start_selection`, `start_outcome` | An optional `vector` with starting values for the parameters of the PS and the outcome equation. | | `verbose` | A `logical` value indicating if information should be printed. | | `se` | A `logical` value indicating whether to calculate and return the standard error of the estimated mean. | | `...` | Additional optional argument for further development (currently not supported). | The package allows the user to provide either reference population data (via the `pop_totals`, or `pop_means` and `pop_size`) or a probability sample declared by the `svydesign` argument (`svydesign2` class from the **survey** package). The `nonprob()` function is used to specify inference methods through the `selection` and `outcome` arguments. If a `svydesign2` object is provided and the `selection` argument is specified, then the IPW estimators are used (by default parameters of the PS model employ Equation 7), if the `outcome` argument is specified, then the MI approach is used (the default option is the MI-GLM with the `gaussian` family) and if both are specified, then the DR approach is applied (parameters $(\boldsymbol{\beta}, \boldsymbol{\gamma})$ are estimated separately and the Equation 10 is used). A particular inference method is selected through the `method_selection`, `method_outcome`, `family_outcome`, `control_selection` and `control_outcome` arguments. The variance estimation method is selected through the `control_inference` argument. Resulting object of class `nonprob` is a list that contains the following (most important) elements: * `data` -- a `data.frame` containing the non-probability sample. * `X` -- a `matrix` containing both samples. * `y` -- a `list` containing all variables declared in either the `target` or `outcome` arguments. * `R` -- a numeric `vector` informing about inclusion in the non-probability sample. * `ps_scores` -- propensity scores or `NULL` (for the MI estimators). * `ipw_weights` -- inverse probability weights or `NULL` (for the MI estimators). * `output` -- a `data.frame` containing point and standard error estimates. * `confidence_interval` -- a `data.frame` containing confidence intervals for the mean. * `outcome` -- a `list` of results for each `outcome` model. * `selection` -- a `list` of results for the `selection` model. * `svydesign` -- a `svydesign2` object passed by the `svydesign` argument. * `ys_rand_pred, ys_nons_pred` -- a `list` of predicted values from the MI model. ## Controlling the type of estimators The `control_out()` function can be used to specify various aspects of the estimation process, including the variable selection methods (through different `penalty` options like SCAD, LASSO, and MCP with their respective tuning parameters defined in the same way as in the `control_sel()` function; cf. @yang_doubly_2020), and detailed configuration for NN, PMM and NPAR approaches (using parameters like `pmm_match_type`, `pmm_weights`, `pmm_k_choice` or `npar_loess`). For NN and PMM approaches we use the **RANN** package [@rann-pkg] with the kd-tree algorithm and the Euclidean distance as default. We currently do not support other distances (e.g., Gower). Table 6 presents example usage of the `control_out()` function for three types of MI estimators. Table: **Table 6.** Example declarations of the MI estimators. | Estimator | Declaration with `control_out()` | |:---|:---| | MI-GLM with the LASSO penalty and 5 folds | `nonprob(outcome = y1 ~ x1 + x2, data = df, svydesign = prob, control_outcome = control_out(penalty = "lasso", nfolds = 5))` | | MI-NN with the `bd` algorithm | `nonprob(outcome = y1 ~ x1 + x2, data = df, svydesign = prob, control_outcome = control_out(treetype = "bd"))` | | MI-PMM-B with $k=3$ | `nonprob(outcome = y1 ~ x1 + x2, data = df, svydesign = prob, control_outcome = control_out(k = 3, pmm_match_type = 2))` | The `control_sel()` function provides essential control parameters for fitting the selection model in the `nonprob()` function. It allows users to select between MLE given by Equation 7 or GEE defined in Equation 8 through the `est_method` argument, specify the $\boldsymbol{h}(\cdot, \cdot)$ function through the `gee_h_fun` argument, specify the optimizer (`optimizer` argument) and which variable selection method should be applied (using different penalty functions like SCAD, lasso, and MCP by specifying the `penalty` argument) along with parameters (e.g., the number of folds through the `nfolds` argument). The parameters of the PS for the calibrated IPW is estimated by using the **nleqslv** package and fitting parameters (arguments starting with the `nleqslv_*`). Table 7 presents example usage of the `control_sel()` function for two types of IPW estimators. Table: **Table 7.** Example declarations of the IPW estimators. | Estimator | Declaration with `control_sel()` | |:---|:---| | Calibrated IPW | `nonprob(selection = ~ x1 + x2, target = ~ y1, data = df, svydesign = prob, control_selection = control_sel(est_method = "gee"))` | | IPW with the MCP penalty and 5 folds | `nonprob(selection = ~ x1 + x2, target = ~ y1, data = df, svydesign = prob, control_selection = control_sel(penalty = "MCP", nfolds = 5))` | ## Controlling variance estimation Finally, the `control_inf()` function configures the parameters for variance estimation in the `nonprob()` function. It allows users to specify whether the analytical or bootstrap approach should be used (the `var_method` argument), whether the variable selection method should be applied (the `vars_selection` argument) and what type of bootstrap should be applied for the probability sample (the `rep_type` argument). This function is also used to specify the inference procedure for the DR approach: if a union of variables should be applied (the `vars_combine` argument) and if the bias correction (the `bias_correction` argument) should be applied after variable selection (the `vars_selection`) and variable combination. Table 8 presents example usage of the `control_inf()` function for the IPW and DR estimators. Table: **Table 8.** Example declarations of the IPW and DR estimators. | Estimator | Declaration with `control_inf()` | |:---|:---| | Calibrated IPW with variable selection, bootstrap and $B=50$ | `nonprob(selection = ~ x1 + x2, target = ~ y1, data = df, svydesign = prob, control_selection = control_sel(est_method = "gee"), control_inference = control_inf(vars_selection = TRUE, var_method = "bootstrap", rep_type = "subbootstrap", B = 50))` | | The DR with the SCAD penalty, 5 folds and bias correction | `nonprob(selection = ~ x1 + x2, outcome = y1 ~ x1 + x2, data = df, svydesign = prob, control_selection = control_sel(penalty = "SCAD", nfolds = 5), control_inference = control_inf(vars_selection = TRUE, bias_correction = TRUE, vars_combine = TRUE))` | In the next sections we present a case study illustrating the process of integrating a non-probability sample with a reference probability sample. We present various estimators and compare them. Finally, we describe more advanced options available in the package. # Data analysis example ## Description of the data The package can be installed in the standard manner using: ```{r install, eval = FALSE} install.packages("nonprobsvy") ``` Before we explain the case study let us first load the necessary packages for this paper. ```{r libs, message = FALSE, warning = FALSE} library("nonprobsvy") library("ggplot2") ``` The goal of the case study was to integrate survey (`jvs`) and administrative (`admin`) data about job vacancies in Poland. The first source, the job vacancy survey (JVS), contains 6,523 units. The JVS provides a probability sample drawn according to a stratified sampling design. More details can be found in @jvs2022. The dataset contains information about the industry code (14 levels, the `nace` column), `region` (16 levels), sector (2 levels, the `private` column), company size (3 levels: Small up to 9, Medium 10-49 and Large 50+) and the final weight (i.e., the design weight corrected for non-contact and non-response), which is treated as the $d$ weight. ```{r jvs-data} data("jvs", package = "nonprobsvy") head(jvs) ``` Since the **nonprobsvy** package relies on the functionalities of the **survey** package, we need to define the `svydesign2` object via the `svydesign` function, as shown below. The dataset does not contain the true stratification variable, so we use a simplified version by specifying `~ size + nace + region`; similarly, since we do not have information regarding non-response and its correction, we simply assume that the `weight` sums up to the population size. ```{r jvs-design} jvs_svy <- svydesign(ids = ~ 1, weights = ~ weight, strata = ~ size + nace + region, data = jvs) ``` Our second source (`admin`), the Central Job Offers Database, is a register containing all vacancies submitted to Public Employment Offices (see ). We treat this register as a non-probability sample since it contains administrative data provided on a voluntary basis, so the inclusion mechanism is unknown. This dataset was prepared in such a way that records deemed to be out of scope (either in terms of the definition of vacancy or the population of entities) were excluded. In addition to the same variables found in the JVS, the dataset contains one called `single_shift`, which is our target variable, defined as: whether a company seeks at least one employee for a single-shift job. The goal of this case study is to estimate the share of companies that seek employees for a single-shift job in Poland in a given quarter. ```{r admin-data} data("admin", package = "nonprobsvy") head(admin) ``` One should keep in mind that this paper does not aim to provide a complete tutorial on how to use non-probability samples for statistical inference. We therefore do not include the stage of aligning variables to meet the same definitions, assessing the strength of the relation between auxiliary variables and the target variable, the selection mechanism and the distribution mis-matches between both samples. In the examples below we assume that there is no overlap between both sources and the naive, reference estimate, given by the mean of the `single_shift` column of `admin`, which is equal to 66.1%. ## Estimation ### Propensity score approach First, we start with the IPW approach, which offers the choice between two estimation methods: MLE (default) and GEE (calibrated to the estimated survey totals). We start by calling the `nonprob` function, where we define the `selection` argument indicating which variables are to be included, the `target` argument, which specifies the variable of interest, i.e., `single_shift`. The remaining arguments specify the `svydesign` object, the dataset and the link function (`method_selection`; `"logit"` is the default value). As the `pop_size` argument was not specified, the $\hat{\mu}_{y, \text{IPW-Hájek}}$ is returned. ```{r ipw1} ipw_est1 <- nonprob( selection = ~ region + private + nace + size, target = ~ single_shift, svydesign = jvs_svy, data = admin, method_selection = "logit" ) ``` In order to get the basic information about the estimated target quantity we can use the `print` method to display the object. It provides information about the methods, source of auxiliary variable, variable selection, the naive (uncorrected estimator) and the corrected along with the standard error and confidence interval. The last line, i.e., `selected estimator`, is limited to the maximum of 5 variables. ```{r ipw1-print} ipw_est1 ``` If we want to see detailed information about the approach, we can use the `summary` method. This function provides information on the sample sizes, sum of the IPW weights along with the distribution of IPW weights (for non-probability sample) and propensity scores (denoted as IPW probabilities) for both samples. A more advanced user may be interested in inspecting the details of the fit by accessing the `selection` element (i.e., `ipw_est1$selection`). ```{r ipw1-summary} summary(ipw_est1) ``` To extract the point estimate along with its standard error and 95% confidence interval in a form of `data.frame` we can use the `extract` method as shown below. ```{r ipw1-extract} extract(ipw_est1) ``` If we want to use the calibrated IPW approach, it is necessary to define the `control_sel()` function in the `control_selection` argument by setting the `est_method` argument equal to `"gee"` (the default is `"mle"`) and the value of `gee_h_fun`. ```{r ipw2} ipw_est2 <- nonprob( selection = ~ region + private + nace + size, target = ~ single_shift, svydesign = jvs_svy, data = admin, method_selection = "logit", control_selection = control_sel(gee_h_fun = 1, est_method = "gee") ) ``` Results are comparable to the standard IPW point estimate (70.4 vs 72.2) while the standard error is slightly higher. ```{r ipw2-print} ipw_est2 ``` The calibrated IPW significantly improves the balance, which can be accessed via the `check_balance()` function: ```{r ipw-balance} data.frame(ipw_mle = check_balance(~ size - 1, ipw_est1, 1)$balance, ipw_gee = check_balance(~ size - 1, ipw_est2, 1)$balance) ``` Notice that neither in the package nor in this paper do we provide a detailed description of the post-hoc results, such as the covariate balance. This can be done using existing CRAN packages, e.g., through the `bal.tab()` function from the **cobalt** package [@cobalt]. ### Prediction-based approach If the user is interested in the prediction-based approach, in particular involving MI estimators, then, they should specify the argument `outcome` as a formula (as in the case of the `glm()` function). We allow a single outcome (specified as `y ~ x1 + x2 + ... + xk`) and multiple outcomes (as `y1 + y2 + y3 ~ x1 + x2 + ... + xk`). Note that if the `outcome` argument is specified, then there is no need to specify the `target` argument. By default, the GLM type of an MI estimator is used (i.e., `method_outcome = "glm"`). In the code below we show how this type of an MI estimator can be declared. ```{r mi1} mi_est1 <- nonprob( outcome = single_shift ~ region + private + nace + size, svydesign = jvs_svy, data = admin, method_outcome = "glm", family_outcome = "binomial" ) mi_est1 ``` In order to employ an MI estimator based on NN matching, one can specify `method_outcome = "nn"` for the nearest neighbors search using all variables specified in the `outcome` argument, `method_outcome = "pmm"` to use PMM or `method_outcome = "npar"` to use non-parametric approach. For the NN and PMM estimators, we employ $k=5$ nearest neighbors (i.e., `control_out(k=5)`) but the variance of the PMM estimator should be estimated using bootstrap approach suggested by @chlebicki2025. However, in this example we have decided not to use bootstrap as all variables are categorical and this results in a large number of NNs having the same distance and the numerical approximation defined in `control_out(eps)` may give slightly different results between platforms. Therefore the reported variance is based on the probability sample $S_{\text{P}}$ only. In the case of the MI-NN estimator there is no need to specify the `family_outcome` argument as no model is estimated underneath. ```{r mi23} mi_est2 <- nonprob( outcome = single_shift ~ region + private + nace + size, svydesign = jvs_svy, data = admin, method_outcome = "nn", control_outcome = control_out(k = 5) ) mi_est3 <- nonprob( outcome = single_shift ~ region + private + nace + size, svydesign = jvs_svy, data = admin, method_outcome = "pmm", family_outcome = "binomial", control_outcome = control_out(k = 5) ) ``` Results of both estimators are more or less similar, but it should be noted that the NN version suffers from the curse of dimensionality, so the PMM version seems to be more reliable. ```{r mi23-extract} rbind("NN" = extract(mi_est2)[, 2:3], "PMM" = extract(mi_est3)[, 2:3]) ``` As discussed in the methods section, IPW and MI estimators are asymptotically unbiased only when the model and auxiliary variables are correctly specified. To overcome this problem, the user can turn to doubly robust estimators. ### The doubly robust approach In order to choose doubly robust estimation the user needs to specify both the `selection` and `outcome` arguments. These formulas can be specified with the same or varying number of auxiliary variables. As in the MI approach, we also allow multiple outcomes. In the following example code we have specified the non-calibrated IPW and the MI-GLM estimator. ```{r dr1} dr_est1 <- nonprob( selection = ~ region + private + nace + size, outcome = single_shift ~ region + private + nace + size, svydesign = jvs_svy, data = admin, method_selection = "logit", method_outcome = "glm", family_outcome = "binomial" ) dr_est1 ``` Detailed results can be displayed by using the `summary` method, which prints both information about the distribution of the IPW weights as well as the outcome residuals and predictions. For more details one can access the `"outcome"` and `"selection"` fields in the `nonprob` object. ```{r dr1-summary} summary(dr_est1) ``` Finally, we can use the bias minimization approach, as proposed by @yang_doubly_2020, by specifying the `control_inference = control_inf(bias_correction = TRUE)` argument together with the `vars_combine = TRUE` and `vars_selection = TRUE` as this approach requires variable selection followed by variable union from both equations. ```{r dr2} set.seed(2024) dr_est2 <- nonprob( selection = ~ region + private + nace + size, outcome = single_shift ~ region + private + nace + size, svydesign = jvs_svy, data = admin, method_selection = "logit", method_outcome = "glm", family_outcome = "binomial", control_inference = control_inf(bias_correction = TRUE, vars_combine = TRUE, vars_selection = TRUE) ) dr_est2 ``` ## Comparison of estimates Finally, as there is no single method for non-probability samples, we suggest comparing results in a single table or a plot. Figure 1 presents point estimates along with 95% confidence intervals. The various estimators show interesting patterns compared to the naive estimate (red dashed line). Both IPW estimators are characterized with large standard errors and the point estimates over 70%. The MI estimators demonstrate notably different behaviours: while MI-PMM produces the highest point estimate with the widest confidence interval, MI-NN yields the lowest estimate, close to the naive value. Results for the other estimators -- MI-GLM and DR (with and without bias minimization) -- are clustered together, with similar point estimates and confidence interval widths, suggesting some consensus in their bias correction. All these methods indicate a population parameter higher than the naive estimate, but their relative consistency, provides a certain degree of confidence in their bias correction capabilities. ```{r comparison-of-est, fig.cap = "Figure 1: Comparison of estimates of the share of job vacancies offered on a single-shift."} df_s <- rbind(extract(ipw_est1), extract(ipw_est2), extract(mi_est1), extract(mi_est2), extract(mi_est3), extract(dr_est1), extract(dr_est2)) df_s$est <- c("IPW (MLE)", "IPW (GEE)", "MI (GLM)", "MI (NN)", "MI (PMM)", "DR", "DR (BM)") ggplot(data = df_s, aes(y = est, x = mean, xmin = lower_bound, xmax = upper_bound)) + geom_point() + geom_vline(xintercept = mean(admin$single_shift), linetype = "dotted", color = "red") + geom_errorbar() + labs(x = "Point estimator and confidence interval", y = "Estimators") + theme_bw() ``` ## Advanced usage ### Bootstrap approach for variance estimation In the package we allow the user to estimate the variance of the mean analytically (by default) or using the bootstrap approach, as described in the prediction approach section. We use analytical variance estimators proposed in the papers referenced in the methods section. The calculation of the standard error can be disabled using `nonprob(se = FALSE)`. The bootstrap approach implemented in the package refers to: * the non-probability sample -- currently only simple random sampling with replacement is available via the `base::sample()`, * the probability sample -- all the approaches implemented in the `as.svrepdesign()` function of the **survey** package are supported and we refer the reader to the relevant help file. The bootstrap approach is specified via the `control_inf()` function with `var_method = "bootstrap"`. The bootstrap method for the probability sample is controlled via the `rep_type` argument, which passes the method to the `as.svrepdesign()` function. The number of iterations is set in the `num_boot` argument (100 by default). If the samples are large or the estimation method is complicated (e.g., involves variable selection) one can set `verbose = TRUE` to track progress. By default bootstrap results are stored in the `boot_sample` element of the resulting list (to disable this option, `keep_boot` should be set to `FALSE`). The following code is an example of applying the IPW approach with the bootstrap approach specified by the argument `control_inference` of the `nonprob()` function. ```{r ipw-boot} set.seed(2024) ipw_est1_boot <- nonprob( selection = ~ region + private + nace + size, target = ~ single_shift, svydesign = jvs_svy, data = admin, method_selection = "logit", control_inference = control_inf(var_method = "bootstrap", num_boot = 50), verbose = FALSE ) ``` Next, we compare the estimated standard error of variance estimation for the analytical and the bootstrap approach. ```{r ipw-boot-compare} rbind("IPW analytic variance" = extract(ipw_est1)[, 2:3], "IPW bootstrap variance" = extract(ipw_est1_boot)[, 2:3]) ``` The boot samples can be accessed via the `boot_sample` element of the output list of the `nonprob()` function. Note that the output is returned as a `matrix` because we allow multiple `target` variables. ```{r ipw-boot-sample} head(ipw_est1_boot$boot_sample, n = 3) ``` ### Variable selection algorithms In this section we briefly show how to use variable selection algorithms. In order to indicate that a variable selection algorithm should be used one should specify the `control_inference = control_inf(vars_selection = TRUE)` argument. Then, the user should either leave the default setting or specify the outcome parameters via the `control_out()` function or the `control_sel()` function. Both functions have the same parameters: * `penalty` -- The penalization function used during variables selection (possible values: `c("SCAD", "lasso", "MCP")`). * `nlambda` -- The number of $\lambda$ values; by default set to 50 (grid search). * `lambda_min` -- The smallest value for $\lambda$, as a fraction of `lambda.max`; 0.001 by default. * `lambda` -- A user specified vector of lambdas (only for the `control_sel()` function). * `nfolds` -- The number of folds for cross validation; by default set to 10. * `a_SCAD, a_MCP` -- The tuning parameter of the SCAD and MCP penalty for the selection model; by default set to 3.7 and 3, respectively. In the case of the MI approach we rely on the **ncvreg** package [@ncvreg], which is the only R package that employs the SCAD method. For the IPW and DR approaches, we have developed our own codes in C++ via the **Rcpp** and **RcppArmadillo** packages. In the code below we apply variable selection for the MI-GLM estimator using only 5 folds, 25 possible values of $\lambda$ parameters and the LASSO penalty. ```{r mi-sel} set.seed(2024) mi_est1_sel <- nonprob( outcome = single_shift ~ region + private + nace + size, svydesign = jvs_svy, data = admin, method_outcome = "glm", family_outcome = "binomial", control_outcome = control_out(nfolds = 5, nlambda = 25, penalty = "lasso"), control_inference = control_inf(vars_selection = TRUE), verbose = TRUE ) ``` In this case study, the MI-GLM estimator with variable selection yields almost the same results as the approach without it. Point estimates and standard errors differ at the fourth and third digit, respectively. ```{r mi-sel-compare} rbind("MI without var sel" = extract(mi_est1)[, 2:3], "MI with var sel" = extract(mi_est1_sel)[, 2:3]) ``` The result object of the `cv.ncvreg` class is stored in the `"outcome"` element of the result, and you can access the selected variables using the `coef` generic method as follows. ```{r mi-sel-coef} round(coef(mi_est1_sel)$coef_out[, 1], 4) ``` If a user is interested in viewing the PS estimation results, then access to the `"selection"` element is required. ```{r ipw-coef} round(coef(ipw_est1)$coef_sel[, 1], 4) ``` # Classes and S3 methods The package contains the main class `nonprob` and the supplementary class `nonprob_method` and the related `nonprob_summary` class. All available S3 methods can be obtained by calling `methods(class = "nonprob")`. For instance, the `check_balance()` function, already mentioned in the case study, is used to view the balance by checking how the PS weights reproduce known or estimated population totals; the `nobs()` function returns the sample size of the probability and non-probability samples. ```{r nobs} nobs(dr_est1) ``` Table 9 presents S3 methods implemented for objects of class `nonprob`. We have intentionally limited the number of implemented methods as the goal of the package is to provide point and interval estimates. If users are interested in assessing the quality of the models or the covariate balance, they should use existing R packages. Table: **Table 9.** S3 methods implemented for objects of class `nonprob` in **nonprobsvy**. | Function | Description | |:---|:---| | `check_balance` | Returns a `list` of totals for IPW and DR estimators. | | `coef` | Returns a list of coefficients for outcome or selection model. | | `confint` | Returns the confidence interval for the target variable(s). | | `extract` | Returns a `data.frame` with estimation results. | | `nobs` | Returns the number of observations for both samples. | | `plot` | Plots a comparison of naive (uncorrected) and estimated mean along with CI. | | `pop_size` | Returns the population size (fixed or not). | | `summary` | Returns a `nonprob_summary` class with additional information. | | `update` | Returns an `updated` object. | | `weights` | Returns IPW weights for the IPW and DR estimator. | The confidence interval for the target variable(s) can be estimated using the generic `confint()` function. ```{r confint} confint(dr_est1, level = 0.99) ``` If a user is interested in assessing the distribution of the IPW weights, we suggest using the `weights()` function. ```{r weights} summary(weights(dr_est1)) ``` In the case of mass imputation estimators it is possible to use special functions `method_glm()` that are of class `nonprob_method`. An example is given below ```{r method-glm} res_glm <- method_glm( y_nons = admin$single_shift, X_nons = model.matrix(~ region + private + nace + size, admin), X_rand = model.matrix(~ region + private + nace + size, jvs), svydesign = jvs_svy) res_glm ``` For the IPW we have created the function `method_ps()`, which returns requested functions for a specific link. This approach is motivated by the following reasons: 1) we allow different estimation techniques, such as maximum likelihood and general estimation equations; 2) the IPW estimator is also used for the DR estimator; and 3) variance estimators depend on whether the IPW or the DR estimator is applied (with combination with the MI estimator). ```{r method-ps} method_ps() ``` Details can be found in help page. ```{r method-ps-help, eval = FALSE} ?method_ps() ``` # Summary and future work The **nonprobsvy** package provides a comprehensive R software solution that addresses inference challenges connected with non-probability samples by integrating them with probability samples or known population totals/means. As non-probability data sources, like administrative registers, voluntary online panels, and social media data become increasingly available, statisticians need robust methods to produce reliable population estimates. The package implements state-of-the-art approaches including mass imputation, inverse probability weighting, and doubly robust methods, each designed to correct selection bias by leveraging auxiliary data. By providing a unified framework and its integration with the **survey** package, the **nonprobsvy** makes complex statistical methods for non-probability samples more accessible, enabling researchers to produce robust estimates even when working with non-representative data. There are several avenues for future development of the **nonprobsvy** package. One key priority is to implement model-based calibration and additional methods for estimating propensity scores and weights. The package currently assumes no overlap between probability and non-probability samples, so accounting for potential overlap (e.g., in big data sources and registers) is another important extension. Additional planned developments include handling non-ignorable sample selection mechanisms, developing a theory for maintaining consistency with calibration weights, and supporting multiple non-probability samples from various sources for the purpose of data integration. Further methodological extensions under consideration include empirical likelihood approaches for doubly/multiply robust estimation, integration of machine learning methods like debiased/double machine learning from causal inference, handling measurement errors in big data variables, and expanding the bootstrap approach beyond simple random sampling with replacement. The package will also be extended to handle the `svyrep.design` class from the **survey** package and the **svrep** package. These developments will enhance its capabilities for handling complex survey data structures and modern estimation challenges. # Acknowledgements The authors' work has been financed by the National Science Centre in Poland, OPUS 20, grant no. 2020/39/B/HS4/00941. Łukasz Chrostowski was the main developer and maintainer of the package up to version 0.1.0. Parts of this paper are based on Łukasz's Master's thesis (available at ). Piotr Chlebicki has contributed to the package and has implemented MI-PMM estimators. Maciej Beręsewicz was responsible for the design of the package and for testing, reviewing and contributing to the source code. He also prepared the manuscript. He was also responsible for a significant restructuring of the package between versions 0.1.0 and 0.2.0. The authors thank editors and reviewers for their constructive suggestions. # Appendix: Algorithms for the MI-NN and MI-PMM estimators **Algorithm 1.** Mass imputation using the $k$ nearest neighbor algorithm. 1. If $k=1$, then for each $i \in S_{\text{P}}$ match $\hat{\nu}(i)$ such that $\hat{\nu}(i)=\operatorname{arg\,min}_{j\in S_{\text{NP}}}d\left(\boldsymbol{x}_i,\boldsymbol{x}_j\right)$. 2. If $k>1$, then $\hat{\nu}(i, z) = \operatorname{arg\,min}_{j\in S_{\text{NP}}\setminus\bigcup_{t=1}^{z-1}\{\hat{\nu}(i, t)\}} d\left(\boldsymbol{x}_i, \boldsymbol{x}_j\right)$, i.e., $\hat{\nu}(i, z)$ is $z$-th nearest neighbor from the sample $S_{\text{NP}}$. 3. For each $i \in S_{\text{P}}$, calculate the imputed value as $y_i^* = \frac{1}{k}\sum_{t=1}^{k}y_{\hat{\nu}(i, t)}$. **Algorithm 2.** Mass imputation using predictive mean matching variant: $\hat{y}-\hat{y}$ matching. 1. Estimate regression model $m(\boldsymbol{x}, \boldsymbol{\beta})$ parameters. 2. Predict $\hat{y}_{i}=m\left(\boldsymbol{x}_{i}, \hat{\boldsymbol{\beta}}\right), \hat{y}_{j}=m\left(\boldsymbol{x}_{j}, \hat{\boldsymbol{\beta}}\right)$ for $i\in S_{\text{P}}, j\in S_{\text{NP}}$ and assign each $i\in S_{\text{P}}$ to $\hat{\nu}(i)$, where $\hat{\nu}(i)=\operatorname{arg\,min}_{j\in S_{\text{NP}}}d\left(\hat{y}_{i}, \hat{y}_{j}\right)$. 3. If $k>1$, then $\hat{\nu}(i, z) = \operatorname{arg\,min}_{j\in S_{\text{NP}}\setminus\bigcup_{t=1}^{z-1}\{\hat{\nu}(i, t)\}} d\left(\hat{y}_{i}, \hat{y}_{j}\right)$, e.g., $\hat{\nu}(i, z)$ is $z$-th nearest neighbor from a sample $S_{\text{NP}}$. 4. For $i \in S_{\text{P}}$, calculate imputation value as $y_i^* = \frac{1}{k}\sum_{t=1}^{k}y_{\hat{\nu}(i, t)}$. **Algorithm 3.** Mass imputation using predictive mean matching variant: $\hat{y}-y$ matching. 1. Estimate regression model $m(\boldsymbol{x}, \boldsymbol{\beta})$ parameters. 2. Predict $\hat{y}_{i}=m\left(\boldsymbol{x}_{i}, \hat{\boldsymbol{\beta}}\right)$ for $i \in S_{\text{P}}$ and assign each $i \in S_{\text{P}}$ to $\hat{\nu}(i)$, where $\hat{\nu}(i)=\operatorname{arg\,min}_{j \in S_{\text{NP}}}d\left(\hat{y}_{i}, y_{j}\right)$. 3. If $k>1$, then $\hat{\nu}(i, z) = \operatorname{arg\,min}_{j \in S_{\text{NP}} \setminus \bigcup_{t=1}^{z-1}\{\hat{\nu}(i, t)\}} d\left(\hat{y}_{i}, y_{j}\right)$. 4. For each $i \in S_{\text{P}}$ calculate imputation value as $y_i^* = \frac{1}{k}\sum_{t=1}^{k}y_{\hat{\nu}(i, t)}$. # References