| Title: | Inference Based on Non-Probability Samples |
|---|---|
| Description: | Statistical inference with non-probability samples when auxiliary information from external sources such as probability samples or population totals or means is available. The package implements various methods such as inverse probability (propensity score) weighting, mass imputation and doubly robust approach. Details can be found in: Chen et al. (2020) <doi:10.1080/01621459.2019.1677241>, Yang et al. (2020) <doi:10.1111/rssb.12354>, Kim et al. (2021) <doi:10.1111/rssa.12696>, Yang et al. (2021) <https://www150.statcan.gc.ca/n1/pub/12-001-x/2021001/article/00004-eng.htm> and Wu (2022) <https://www150.statcan.gc.ca/n1/pub/12-001-x/2022002/article/00002-eng.htm>. For details on the package and its functionalities see <doi:10.48550/arXiv.2504.04255>. |
| Authors: | Łukasz Chrostowski [aut, ctb], Maciej Beręsewicz [aut, cre] (ORCID: <https://orcid.org/0000-0002-8281-4301>), Piotr Chlebicki [aut, ctb] (ORCID: <https://orcid.org/0009-0006-4867-7434>) |
| Maintainer: | Maciej Beręsewicz <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.3.0 |
| Built: | 2026-05-23 09:10:28 UTC |
| Source: | https://github.com/ncn-foreigners/nonprobsvy |
This is a subset of the Central Job Offers Database, a voluntary administrative data set (non-probability sample). The data was slightly manipulated to ensure the relationships were preserved, and then aligned. For more information about the CBOP, please refer to: https://oferty.praca.gov.pl/portal.
adminadmin
A single data.frame with 9,344 rows and 6 columns
idIdentifier of an entity (company: legal or local).
privateWhether the company is a private (1) or public (0) entity.
sizeThe size of the entity: S – small (to 9 employees), M – medium (10-49) or L – large (over 49).
naceThe main NACE code for a given entity: C, D.E, F, G, H, I, J, K.L, M, N, O, P, Q or R.S (14 levels, 3 combined: D and E, K and L, and R and S).
regionThe region of Poland (16 levels: 02, 04, ..., 32).
single_shiftWhether an entity seeks employees on a single shift.
data("admin") head(admin)data("admin") head(admin)
Function compares totals for auxiliary variables specified in the x argument for an object that
contains either IPW or DR estimator.
check_balance(x, object, dig)check_balance(x, object, dig)
x |
formula specifying variables to check |
object |
object of |
dig |
number of digits for rounding (default = 2) |
A list containing totals for non-probability and probability samples and their differences
sample_a <- data.frame(y = c(1, 0, 1, 0, 1), x = c(0, 1, 2, 3, 4)) sample_b <- data.frame(x = c(0.5, 1.5, 2.5, 3.5), w = c(4, 4, 4, 4)) sample_b_svy <- svydesign(ids = ~1, weights = ~w, data = sample_b) ipw_est1 <- nonprob( selection = ~x, target = ~y, svydesign = sample_b_svy, data = sample_a, method_selection = "logit", se = FALSE ) ipw_est2 <- nonprob( selection = ~x, target = ~y, svydesign = sample_b_svy, data = sample_a, method_selection = "logit", control_selection = control_sel(est_method = "gee", gee_h_fun = 1), se = FALSE ) ## check the balance for the standard IPW check_balance(~x, ipw_est1) ## check the balance for the calibrated IPW check_balance(~x, ipw_est2)sample_a <- data.frame(y = c(1, 0, 1, 0, 1), x = c(0, 1, 2, 3, 4)) sample_b <- data.frame(x = c(0.5, 1.5, 2.5, 3.5), w = c(4, 4, 4, 4)) sample_b_svy <- svydesign(ids = ~1, weights = ~w, data = sample_b) ipw_est1 <- nonprob( selection = ~x, target = ~y, svydesign = sample_b_svy, data = sample_a, method_selection = "logit", se = FALSE ) ipw_est2 <- nonprob( selection = ~x, target = ~y, svydesign = sample_b_svy, data = sample_a, method_selection = "logit", control_selection = control_sel(est_method = "gee", gee_h_fun = 1), se = FALSE ) ## check the balance for the standard IPW check_balance(~x, ipw_est1) ## check the balance for the calibrated IPW check_balance(~x, ipw_est2)
Returns a list of coefficients for the selection and the outcome models
## S3 method for class 'nonprob' coef(object, ...)## S3 method for class 'nonprob' coef(object, ...)
object |
a |
... |
other arguments passed to methods (currently not supported) |
a list with two entries:
"coef_sel" a matrix of coefficients for the selection equation if possible, else NULL
"coef_dr" a matrix of coefficients for the outcome equation(s) if possible, else NULL
data(admin) data(jvs) jvs_svy <- svydesign(ids = ~ 1, weights = ~ weight, strata = ~ size + nace + region, data = jvs) ipw_est1 <- nonprob(selection = ~ region + private + nace + size, target = ~ single_shift, svydesign = jvs_svy, data = admin, method_selection = "logit", se = FALSE ) coef(ipw_est1)data(admin) data(jvs) jvs_svy <- svydesign(ids = ~ 1, weights = ~ weight, strata = ~ size + nace + region, data = jvs) ipw_est1 <- nonprob(selection = ~ region + private + nace + size, target = ~ single_shift, svydesign = jvs_svy, data = admin, method_selection = "logit", se = FALSE ) coef(ipw_est1)
A generic function that returns the confidence interval
for the estimated mean. If standard errors have not been estimated, the function
updates the nonprob object to obtain standard errors.
## S3 method for class 'nonprob' confint(object, parm, level = 0.95, ...)## S3 method for class 'nonprob' confint(object, parm, level = 0.95, ...)
object |
object of |
parm |
names of parameters for which confidence intervals are to be computed, if missing all parameters will be considered. |
level |
confidence level for intervals. |
... |
additional arguments |
returns a data.frame with confidence intervals for the target variables
data(admin) data(jvs) jvs_svy <- svydesign(ids = ~ 1, weights = ~ weight, strata = ~ size + nace + region, data = jvs) ipw_est1 <- nonprob(selection = ~ region + private + nace + size, target = ~ single_shift, svydesign = jvs_svy, data = admin, method_selection = "logit", se = FALSE ) confint(ipw_est1)data(admin) data(jvs) jvs_svy <- svydesign(ids = ~ 1, weights = ~ weight, strata = ~ size + nace + region, data = jvs) ipw_est1 <- nonprob(selection = ~ region + private + nace + size, target = ~ single_shift, svydesign = jvs_svy, data = admin, method_selection = "logit", se = FALSE ) confint(ipw_est1)
control_inf constructs a list with all necessary control parameters
for statistical inference.
control_inf( var_method = c("analytic", "bootstrap"), rep_type = c("subbootstrap", "auto", "JK1", "JKn", "BRR", "bootstrap", "mrbbootstrap", "Fay"), vars_selection = FALSE, vars_combine = FALSE, bias_correction = FALSE, num_boot = 100, alpha = 0.05, cores = 1, keep_boot = TRUE, nn_exact_se = FALSE )control_inf( var_method = c("analytic", "bootstrap"), rep_type = c("subbootstrap", "auto", "JK1", "JKn", "BRR", "bootstrap", "mrbbootstrap", "Fay"), vars_selection = FALSE, vars_combine = FALSE, bias_correction = FALSE, num_boot = 100, alpha = 0.05, cores = 1, keep_boot = TRUE, nn_exact_se = FALSE )
var_method |
the variance method (default |
rep_type |
the replication type for weights in the bootstrap method for variance estimation passed to |
vars_selection |
logical scalar (default |
vars_combine |
logical scalar indicating whether variables should be combined after variable selection for doubly robust estimators (default |
bias_correction |
logical scalar (default |
num_boot |
the number of iteration for bootstrap algorithms. |
alpha |
significance level (default 0.05). |
cores |
the number of cores in parallel computing (default 1). |
keep_boot |
a logical value indicating whether statistics from bootstrap should be kept (default |
nn_exact_se |
a logical value indicating whether to compute the exact
standard error estimate for |
A list with selected parameters.
nonprob() – for fitting procedure with non-probability samples.
control_out constructs a list with all necessary control parameters
for outcome model.
control_out( epsilon = 1e-08, maxit = 100, trace = FALSE, k = 5, penalty = c("SCAD", "lasso", "MCP"), a_SCAD = 3.7, a_MCP = 3, lambda = -1, lambda_min = 0.001, nlambda = 50, nfolds = 10, treetype = c("kd", "bd"), searchtype = c("standard", "priority"), pmm_match_type = 1, pmm_weights = c("none", "dist"), pmm_k_choice = c("none", "min_var"), pmm_k_max = NULL, pmm_reg_engine = c("glm", "loess"), npar_loess = stats::loess.control(surface = "direct", trace.hat = "approximate") )control_out( epsilon = 1e-08, maxit = 100, trace = FALSE, k = 5, penalty = c("SCAD", "lasso", "MCP"), a_SCAD = 3.7, a_MCP = 3, lambda = -1, lambda_min = 0.001, nlambda = 50, nfolds = 10, treetype = c("kd", "bd"), searchtype = c("standard", "priority"), pmm_match_type = 1, pmm_weights = c("none", "dist"), pmm_k_choice = c("none", "min_var"), pmm_k_max = NULL, pmm_reg_engine = c("glm", "loess"), npar_loess = stats::loess.control(surface = "direct", trace.hat = "approximate") )
epsilon |
Tolerance for fitting algorithms. Default is |
maxit |
Maximum number of iterations. |
trace |
logical value. If |
k |
The k parameter in the |
penalty |
penalty algorithm for variable selection. Default is |
a_SCAD |
The tuning parameter of the SCAD penalty for outcome model. Default is 3.7. |
a_MCP |
The tuning parameter of the MCP penalty for outcome model. Default is 3. |
lambda |
A user-specified |
lambda_min |
The smallest value for lambda, as a fraction of lambda.max. Default is .001. |
nlambda |
The number of lambda values. Default is 50. |
nfolds |
The number of folds during cross-validation for variables selection model. |
treetype |
Type of tree for nearest neighbour imputation (for the NN and PMM estimator) passed to |
searchtype |
Type of search for nearest neighbour imputation (for the NN and PMM estimator) passed to |
pmm_match_type |
(Only for the PMM Estimator)
Indicates how to select 'closest' unit from non-probability sample for each
unit in probability sample. Either |
pmm_weights |
(Only for the PMM Estimator)
Indicate how to weight |
pmm_k_choice |
(Only for the PMM Estimator) Character value indicating how |
pmm_k_max |
(Only for the PMM Estimator) Positive integer upper bound for
the |
pmm_reg_engine |
(Only for the PMM Estimator) whether to use parametric ( |
npar_loess |
control parameters for the stats::loess via the stats::loess.control function. |
List with selected parameters.
nonprob() – for fitting procedure with non-probability samples.
control_sel constructs a list with all necessary control parameters
for selection model.
control_sel( est_method = c("mle", "gee"), gee_h_fun = 1, optimizer = c("maxLik", "optim"), maxlik_method = c("NR", "BFGS", "NM"), optim_method = c("BFGS", "Nelder-Mead"), epsilon = 1e-04, maxit = 500, trace = FALSE, penalty = c("SCAD", "lasso", "MCP"), a_SCAD = 3.7, a_MCP = 3, lambda = -1, lambda_min = 0.001, nlambda = 50, nfolds = 10, print_level = 0, start_type = c("zero", "mle", "naive"), nleqslv_method = c("Broyden", "Newton"), nleqslv_global = c("dbldog", "pwldog", "cline", "qline", "gline", "hook", "none"), nleqslv_xscalm = c("fixed", "auto"), dependence = FALSE, key = NULL )control_sel( est_method = c("mle", "gee"), gee_h_fun = 1, optimizer = c("maxLik", "optim"), maxlik_method = c("NR", "BFGS", "NM"), optim_method = c("BFGS", "Nelder-Mead"), epsilon = 1e-04, maxit = 500, trace = FALSE, penalty = c("SCAD", "lasso", "MCP"), a_SCAD = 3.7, a_MCP = 3, lambda = -1, lambda_min = 0.001, nlambda = 50, nfolds = 10, print_level = 0, start_type = c("zero", "mle", "naive"), nleqslv_method = c("Broyden", "Newton"), nleqslv_global = c("dbldog", "pwldog", "cline", "qline", "gline", "hook", "none"), nleqslv_xscalm = c("fixed", "auto"), dependence = FALSE, key = NULL )
est_method |
Method of estimation for propensity score model ( |
gee_h_fun |
Smooth function for the generalized estimating equations (GEE) method. |
optimizer |
(for the |
maxlik_method |
(for the |
optim_method |
(for the |
epsilon |
Tolerance for fitting algorithms by default |
maxit |
Maximum number of iterations. |
trace |
logical value. If |
penalty |
The penalization function used during variables selection. |
a_SCAD |
The tuning parameter of the SCAD penalty for selection model. Default is 3.7. |
a_MCP |
The tuning parameter of the MCP penalty for selection model. Default is 3. |
lambda |
A user-specified |
lambda_min |
The smallest value for lambda, as a fraction of |
nlambda |
The number of |
nfolds |
The number of folds for cross validation. Default is 10. |
print_level |
this argument determines the level of printing which is done during the optimization (for propensity score model) process. |
start_type |
|
nleqslv_method |
(for the |
nleqslv_global |
(for the |
nleqslv_xscalm |
(for the |
dependence |
logical value (default |
key |
binary key variable allowing to identify the overlap (NOT YET IMPLEMENTED, FOR FUTURE DEVELOPMENT). |
Smooth function (gee_h_fun) for the generalized estimating equations (GEE) method taking the following values
if 1 then ,
if 2 then
List with selected parameters.
nonprob() – for fitting procedure with non-probability samples.
Returns a data.frame of estimated mean(s) or standard error(s)
extract(object, what)extract(object, what)
object |
object of of the |
what |
what to extract: all estimates (mean(s), SE(s) and CI(s); |
a data.frame with selected information
data(admin) data(jvs) jvs_svy <- svydesign(ids = ~ 1, weights = ~ weight, strata = ~ size + nace + region, data = jvs) ipw_est1 <- nonprob(selection = ~ region + private + nace + size, target = ~ single_shift, svydesign = jvs_svy, data = admin, method_selection = "logit" ) extract(ipw_est1) extract(ipw_est1, "se")data(admin) data(jvs) jvs_svy <- svydesign(ids = ~ 1, weights = ~ weight, strata = ~ size + nace + region, data = jvs) ipw_est1 <- nonprob(selection = ~ region + private + nace + size, target = ~ single_shift, svydesign = jvs_svy, data = admin, method_selection = "logit" ) extract(ipw_est1) extract(ipw_est1, "se")
This is a subset of the Job Vacancy Survey from Poland (for one quarter). The data has been subject to slight manipulation, but the relationships in the data have been preserved. For further details on the JVS, please refer to the following link: https://stat.gov.pl/obszary-tematyczne/rynek-pracy/popyt-na-prace/zeszyt-metodologiczny-popyt-na-prace,3,1.html.
jvsjvs
A single data.frame with 6,523 rows and 6 columns
idIdentifier of an entity (company: legal or local).
privateWhether the company is a private (1) or public (0) entity.
sizeThe size of the entity: S – small (to 9 employees), M – medium (10-49) or L – large (over 49).
naceThe main NACE code for a given entity: C, D.E, F, G, H, I, J, K.L, M, N, O, P, Q or R.S (14 levels, 3 combined: D and E, K and L, and R and S).
regionThe region of Poland (16 levels: 02, 04, ..., 32).
weightThe final (calibrated) weight (w-weight). We do not have access to design weights (d-weights).
data("jvs") head(jvs)data("jvs") head(jvs)
Model for the outcome for the mass imputation estimator using generalized linear
models via the stats::glm function. Estimation of the mean is done using
probability sample or known population totals.
method_glm( y_nons, X_nons, X_rand, svydesign, weights = NULL, family_outcome = "gaussian", start_outcome = NULL, vars_selection = FALSE, pop_totals = NULL, pop_size = NULL, control_outcome = control_out(), control_inference = control_inf(), verbose = FALSE, se = TRUE )method_glm( y_nons, X_nons, X_rand, svydesign, weights = NULL, family_outcome = "gaussian", start_outcome = NULL, vars_selection = FALSE, pop_totals = NULL, pop_size = NULL, control_outcome = control_out(), control_inference = control_inf(), verbose = FALSE, se = TRUE )
y_nons |
target variable from non-probability sample |
X_nons |
a |
X_rand |
a |
svydesign |
a svydesign object |
weights |
case / frequency weights from non-probability sample |
family_outcome |
family for the glm model |
start_outcome |
start parameters (default |
vars_selection |
whether variable selection should be conducted |
pop_totals |
population totals from the |
pop_size |
population size from the |
control_outcome |
controls passed by the |
control_inference |
controls passed by the |
verbose |
parameter passed from the main |
se |
whether standard errors should be calculated |
Analytical variance
The variance of the mean is estimated based on the following approach
(a) non-probability part ( with size ; denoted as var_nonprob in the result)
where and
Under the linear regression model and
(b) probability part ( with size ; denoted as var_prob in the result)
This part uses functionalities of the {survey} package and the variance is estimated using the following
equation:
Note that in principle can be estimated in various ways depending on the type of the design and whether population size is known or not.
Furthermore, if only population totals/means are known and assumed to be fixed we set .
Information on the case when svydesign is not available:
variance is estimated only for the non-probability part with defined above.
point estimator of for linear regression is estimated using
where is the vector of population means
for non-linear functions such as logistic or Poisson regression we use a simplification, i.e. we report
point estimate as for Poisson and for logistic regression.
an nonprob_method class which is a list with the following entries
fitted model either a glm.fit, cv.ncvreg, or ncvreg object
predicted values for the non-probablity sample
predicted values for the probability sample or population totals
coefficients for the model (if available)
an updated surveydesign2 object (new column y_hat_MI is added)
estimated population mean for the target variable
whether variable selection was performed
variance for the probability sample component (if available)
variance for the non-probability sampl component
total variance, if possible it should be var_prob+var_nonprob if not, just a scalar
model type (character "glm")
family type (character "glm")
Kim, J. K., Park, S., Chen, Y., & Wu, C. (2021). Combining non-probability and probability survey samples through mass imputation. Journal of the Royal Statistical Society Series A: Statistics in Society, 184(3), 941-963.
data(admin) data(jvs) jvs_svy <- svydesign(ids = ~ 1, weights = ~ weight, strata = ~ size + nace + region, data = jvs) res_glm <- method_glm(y_nons = admin$single_shift, X_nons = model.matrix(~ region + private + nace + size, admin), X_rand = model.matrix(~ region + private + nace + size, jvs), svydesign = jvs_svy) res_glmdata(admin) data(jvs) jvs_svy <- svydesign(ids = ~ 1, weights = ~ weight, strata = ~ size + nace + region, data = jvs) res_glm <- method_glm(y_nons = admin$single_shift, X_nons = model.matrix(~ region + private + nace + size, admin), X_rand = model.matrix(~ region + private + nace + size, jvs), svydesign = jvs_svy) res_glm
Mass imputation using nearest neighbours approach as described in Yang et al. (2021).
The implementation is currently based on RANN::nn2 function and thus it uses
Euclidean distance for matching units from (non-probability) to (probability).
Matching ties are randomized before donor values are aggregated, so tied nearest neighbours
are not selected only by input row order. Estimation of the mean is done using sample:
when pop_size is supplied this is the known- Horvitz-Thompson mean,
otherwise it reduces to the usual ratio mean with .
The pop_size argument is not converted into a finite population correction;
if an fpc is needed, it should be supplied in svydesign, where it is handled
by the {survey} variance routines.
method_nn( y_nons, X_nons, X_rand, svydesign, weights = NULL, family_outcome = NULL, start_outcome = NULL, vars_selection = FALSE, pop_totals = NULL, pop_size = NULL, control_outcome = control_out(), control_inference = control_inf(), verbose = FALSE, se = TRUE, nn_matches = NULL )method_nn( y_nons, X_nons, X_rand, svydesign, weights = NULL, family_outcome = NULL, start_outcome = NULL, vars_selection = FALSE, pop_totals = NULL, pop_size = NULL, control_outcome = control_out(), control_inference = control_inf(), verbose = FALSE, se = TRUE, nn_matches = NULL )
y_nons |
target variable from non-probability sample |
X_nons |
a |
X_rand |
a |
svydesign |
a svydesign object |
weights |
case / frequency weights from non-probability sample. If |
family_outcome |
a placeholder (not used in |
start_outcome |
a placeholder (not used in |
vars_selection |
whether variable selection should be conducted |
pop_totals |
a placeholder (not used in |
pop_size |
population size from the |
control_outcome |
controls passed by the |
control_inference |
controls passed by the |
verbose |
parameter passed from the main |
se |
whether standard errors should be calculated |
nn_matches |
optional precomputed nearest-neighbour search results for
internal reuse across outcomes. If supplied, it should be a list with
|
Analytical variance
The variance of the mean is estimated based on the following approach
(a) non-probability part ( with size ; denoted as var_nonprob in the result)
This may be estimated using
where is an estimator of propensity scores which
we currently estimate using (constant) and is
estimated using based on the average of . The
values used in this proxy are obtained by leave-one-out matching in ,
so a unit is not used as its own donor.
Chlebicki et al. (2025, Algorithm 2) proposed non-parametric mini-bootstrap estimator
(without assuming that it is consistent) but with good finite population properties.
This bootstrap can be applied using control_inference(nn_exact_se=TRUE) and
can be summarized as follows:
Sample units from with replacement to create .
If non-constant pseudo-weights are supplied through weights, sampling probabilities
are proportional to their inverses; equal weights use uniform resampling.
Match units from to to obtain predictions =.
Estimate .
Repeat steps 1-3 times (we set in our simulations; this is hard-coded).
Estimate obtained from simulations and save it as var_nonprob.
(b) probability part ( with size ; denoted as var_prob in the result)
This part uses functionalities of the {survey} package and the variance is estimated using the following
equation:
where and are values imputed imputed as an average
of -nearest neighbour, i.e. . Note that in principle can be estimated in various ways depending on the type of the design and whether population size is known or not.
an nonprob_method class which is a list with the following entries
RANN::nn2 object
predicted values for the non-probablity sample (query to itself)
predicted values for the probability sample
coefficients for the model (if available)
an updated surveydesign2 object (new column y_hat_MI is added)
estimated population mean for the target variable
whether variable selection was performed (not implemented, for further development)
variance for the probability sample component (if available)
variance for the non-probability sample component
total variance, if possible it should be var_prob+var_nonprob if not, just a scalar
model type (character "nn")
placeholder for the NN approach information
Yang, S., Kim, J. K., & Hwang, Y. (2021). Integration of data from probability surveys and big found data for finite population inference using mass imputation. Survey Methodology, June 2021 29 Vol. 47, No. 1, pp. 29-58
Chlebicki, P., Chrostowski, Ł., & Beręsewicz, M. (2025). Data integration of non-probability and probability samples with predictive mean matching. arXiv preprint arXiv:2403.13750.
sample_a <- data.frame(y = c(1, 0, 1, 0, 1), x = c(0, 1, 2, 3, 4)) sample_b <- data.frame(x = c(0.5, 1.5, 2.5, 3.5), w = c(4, 4, 4, 4)) sample_b_svy <- svydesign(ids = ~1, weights = ~w, data = sample_b) res_nn <- method_nn( y_nons = sample_a$y, X_nons = model.matrix(~x, sample_a), X_rand = model.matrix(~x, sample_b), svydesign = sample_b_svy, control_outcome = control_out(k = 1), se = FALSE ) res_nnsample_a <- data.frame(y = c(1, 0, 1, 0, 1), x = c(0, 1, 2, 3, 4)) sample_b <- data.frame(x = c(0.5, 1.5, 2.5, 3.5), w = c(4, 4, 4, 4)) sample_b_svy <- svydesign(ids = ~1, weights = ~w, data = sample_b) res_nn <- method_nn( y_nons = sample_a$y, X_nons = model.matrix(~x, sample_a), X_rand = model.matrix(~x, sample_b), svydesign = sample_b_svy, control_outcome = control_out(k = 1), se = FALSE ) res_nn
Model for the outcome for the mass imputation estimator using loess via stats::loess.
Estimation of the mean is done using the probability sample.
method_npar( y_nons, X_nons, X_rand, svydesign, weights = NULL, family_outcome = "gaussian", start_outcome = NULL, vars_selection = FALSE, pop_totals = NULL, pop_size = NULL, control_outcome = control_out(), control_inference = control_inf(), verbose = FALSE, se = TRUE )method_npar( y_nons, X_nons, X_rand, svydesign, weights = NULL, family_outcome = "gaussian", start_outcome = NULL, vars_selection = FALSE, pop_totals = NULL, pop_size = NULL, control_outcome = control_out(), control_inference = control_inf(), verbose = FALSE, se = TRUE )
y_nons |
target variable from non-probability sample |
X_nons |
a |
X_rand |
a |
svydesign |
a svydesign object |
weights |
case / frequency weights from non-probability sample (default NULL) |
family_outcome |
family for the glm model |
start_outcome |
a place holder (not used in |
vars_selection |
whether variable selection should be conducted |
pop_totals |
a place holder (not used in |
pop_size |
population size from the |
control_outcome |
controls passed by the |
control_inference |
controls passed by the |
verbose |
parameter passed from the main |
se |
whether standard errors should be calculated |
Analytical variance
The variance of the mean is estimated based on the following approach
(a) non-probability part ( with size ; denoted as var_nonprob in the result)
where is the residual and can be estimated
various ways. In the package we estimate using as suggested by Chen et al. (2022, p. 6). In particular,
we currently support this using stats::loesswith"gaussian"' family.
(b) probability part ( with size ; denoted as var_prob in the result)
This part uses functionalities of the {survey} package and the variance is estimated using the following
equation:
Note that in principle can be estimated in various ways depending on the type of the design and whether population size is known or not.
an nonprob_method class which is a list with the following entries
fitted model object returned by stats::loess
predicted values for the non-probablity sample
predicted values for the probability sample or population totals
coefficients for the model (if available)
an updated surveydesign2 object (new column y_hat_MI is added)
estimated population mean for the target variable
whether variable selection was performed
variance for the probability sample component (if available)
variance for the non-probability sampl component
model type (character "npar")
Chen, S., Yang, S., & Kim, J. K. (2022). Nonparametric mass imputation for data integration. Journal of Survey Statistics and Methodology, 10(1), 1-24.
set.seed(123123123) N <- 10000 n_a <- 500 n_b <- 1000 n_b1 <- 0.7*n_b n_b2 <- 0.3*n_b x1 <- rnorm(N, 2, 1) x2 <- rnorm(N, 2, 1) y1 <- rnorm(N, 0.3 + 2*x1+ 2*x2, 1) y2 <- rnorm(N, 0.3 + 0.5*x1^2+ 0.5*x2^2, 1) strata <- x1 <= 2 pop <- data.frame(x1, x2, y1, y2, strata) sample_a <- pop[sample(1:N, n_a),] sample_a$w_a <- N/n_a sample_a_svy <- svydesign(ids=~1, weights=~w_a, data=sample_a) pop1 <- subset(pop, strata == TRUE) pop2 <- subset(pop, strata == FALSE) sample_b <- rbind(pop1[sample(1:nrow(pop1), n_b1), ], pop2[sample(1:nrow(pop2), n_b2), ]) res_y_npar <- nonprob(outcome = y1 + y2 ~ x1 + x2, data = sample_b, svydesign = sample_a_svy, method_outcome = "npar") res_y_nparset.seed(123123123) N <- 10000 n_a <- 500 n_b <- 1000 n_b1 <- 0.7*n_b n_b2 <- 0.3*n_b x1 <- rnorm(N, 2, 1) x2 <- rnorm(N, 2, 1) y1 <- rnorm(N, 0.3 + 2*x1+ 2*x2, 1) y2 <- rnorm(N, 0.3 + 0.5*x1^2+ 0.5*x2^2, 1) strata <- x1 <= 2 pop <- data.frame(x1, x2, y1, y2, strata) sample_a <- pop[sample(1:N, n_a),] sample_a$w_a <- N/n_a sample_a_svy <- svydesign(ids=~1, weights=~w_a, data=sample_a) pop1 <- subset(pop, strata == TRUE) pop2 <- subset(pop, strata == FALSE) sample_b <- rbind(pop1[sample(1:nrow(pop1), n_b1), ], pop2[sample(1:nrow(pop2), n_b2), ]) res_y_npar <- nonprob(outcome = y1 + y2 ~ x1 + x2, data = sample_b, svydesign = sample_a_svy, method_outcome = "npar") res_y_npar
Model for the outcome for the mass imputation estimator. The implementation is currently based on RANN::nn2 function and thus it uses Euclidean distance for matching units from (non-probability) to (probability) based on predicted values from model based
either on method_glm or method_npar. Estimation of the mean is done using sample:
when pop_size is supplied this is the known- Horvitz-Thompson mean,
otherwise it reduces to the usual ratio mean with .
The pop_size argument is not converted into a finite population correction;
if an fpc is needed, it should be supplied in svydesign, where it is handled
by the {survey} variance routines.
Matching ties are randomized by the nearest-neighbour step before donor values are aggregated.
This implementation extends Yang et al. (2021) approach as described in Chlebicki et al. (2025), namely:
if k>1 weighted aggregation of the mean for a given unit is used. We use distance
matrix returned by RANN::nn2 function (pmm_weights from the control_out() function)
if the non-probability sample is small we recommend using a mini-bootstrap
approach to estimate variance from the non-probability sample (nn_exact_se from the control_inf() function).
If non-constant pseudo-weights are supplied, bootstrap samples are drawn with probabilities
proportional to inverse weights and the resampled weights are used in each refitted outcome model.
the main nonprob function allows for dynamic selection of k neighbours based on a
full-grid variance minimization procedure over 1:n_NP (pmm_k_choice from the control_out() function)
method_pmm( y_nons, X_nons, X_rand, svydesign, weights = NULL, family_outcome = "gaussian", start_outcome = NULL, vars_selection = FALSE, pop_totals = NULL, pop_size = NULL, control_outcome = control_out(), control_inference = control_inf(), verbose = FALSE, se = TRUE )method_pmm( y_nons, X_nons, X_rand, svydesign, weights = NULL, family_outcome = "gaussian", start_outcome = NULL, vars_selection = FALSE, pop_totals = NULL, pop_size = NULL, control_outcome = control_out(), control_inference = control_inf(), verbose = FALSE, se = TRUE )
y_nons |
target variable from non-probability sample |
X_nons |
a |
X_rand |
a |
svydesign |
a svydesign object |
weights |
case / frequency weights from non-probability sample. If |
family_outcome |
family for the glm model |
start_outcome |
start parameters |
vars_selection |
whether variable selection should be conducted |
pop_totals |
a place holder (not used in |
pop_size |
population size from the |
control_outcome |
controls passed by the |
control_inference |
controls passed by the |
verbose |
parameter passed from the main |
se |
whether standard errors should be calculated |
Matching
In the package we support two types of matching:
matching (default; control_out(pmm_match_type = 1)).
matching (control_out(pmm_match_type = 2)).
Analytical variance
The variance of the mean is estimated based on the following approach
(a) non-probability part ( with size ; denoted as var_nonprob in the result) is currently estimated using the non-parametric mini-bootstrap estimator proposed by
Chlebicki et al. (2025, Algorithm 2). It is not proved to be consistent but with good finite population properties.
This bootstrap can be applied using control_inference(nn_exact_se=TRUE) and
can be summarized as follows:
Sample units from with replacement to create .
If non-constant pseudo-weights are supplied through weights, sampling probabilities
are proportional to their inverses; equal weights use uniform resampling.
Estimate regression model based on from step 1,
using the resampled weights when supplied.
Compute for using estimated and .
Compute using values from .
Repeat steps 1-4 times (we set (hard-coded) in our code).
Estimate obtained from simulations and save it as var_nonprob.
(b) probability part ( with size ; denoted as var_prob in the result)
This part uses functionalities of the {survey} package and the variance is estimated using the following
equation:
Note that in principle can be estimated in various ways depending on the type of the design and whether population size is known or not.
an nonprob_method class which is a list with the following entries
fitted model either an glm.fit or cv.ncvreg object
predicted values for the non-probablity sample
predicted values for the probability sample or population totals
coefficients for the model (if available)
an updated surveydesign2 object (new column y_hat_MI is added)
estimated population mean for the target variable
whether variable selection was performed
variance for the probability sample component (if available)
variance for the non-probability sampl component
model type (character "pmm")
depends on the method selected for estimating E(Y|X)
sample_a <- data.frame(y = c(1, 0, 1, 0, 1), x = c(0, 1, 2, 3, 4)) sample_b <- data.frame(x = c(0.5, 1.5, 2.5, 3.5), w = c(4, 4, 4, 4)) sample_b_svy <- svydesign(ids = ~1, weights = ~w, data = sample_b) res_pmm <- method_pmm( y_nons = sample_a$y, X_nons = model.matrix(~x, sample_a), X_rand = model.matrix(~x, sample_b), svydesign = sample_b_svy, control_outcome = control_out(k = 1), se = FALSE ) res_pmmsample_a <- data.frame(y = c(1, 0, 1, 0, 1), x = c(0, 1, 2, 3, 4)) sample_b <- data.frame(x = c(0.5, 1.5, 2.5, 3.5), w = c(4, 4, 4, 4)) sample_b_svy <- svydesign(ids = ~1, weights = ~w, data = sample_b) res_pmm <- method_pmm( y_nons = sample_a$y, X_nons = model.matrix(~x, sample_a), X_rand = model.matrix(~x, sample_b), svydesign = sample_b_svy, control_outcome = control_out(k = 1), se = FALSE ) res_pmm
Function to specify the propensity score (PS) model for the inverse probability weighting estimator.
This function provides basic functions logistic regression with a given link function (currently
we support logit, probit and cloglog) with additional information about the analytic variance estimator of the mean.
This is a function returns a list of functions that refer to specific estimation methods and variance estimators
when whether the IPW alone or the DR estimator is applied. The export of this function is mainly because
the functions are used in the variable selection algorithms.
Functions starting with make_log_like, make_gradient and make_hessian refer to the maximum likelihood estimation
as described in the Chen et al. (2020) paper. These functions take into account different link functions defined through
the link argument.
Functions make_link, make_link_inv, make_link_der, make_link_inv_der, make_link_inv_rev, and make_link_inv_rev_der
refer to specific link functions and are used in the estimation process.
Functions variance_covariance1 and variance_covariance2 refer to the variance estimator of the IPW estimator as
defined by Chen et al. (2020).
Functions b_vec_ipw, b_vec_dr and t_vec are specific functions defined in the Chen et al. (2020) that are used
in the variance estimator of the IPW or the DR.
Finally, var_nonprob is the non-probability component of the DR estimator as defined by Chen et al. (2020).
method_ps(link = c("logit", "probit", "cloglog"), ...)method_ps(link = c("logit", "probit", "cloglog"), ...)
link |
link for the PS model |
... |
Additional, optional arguments. |
A list of functions and elements for a specific link function with the following entries:
log-likelihood function for a specific link function
gradient of the loglik
hessian of the loglik
link function
inverse link function
first derivative of the link function
first derivative of the the inverse link function
defines 1/inv_link
first derivative of 1/inv_link
for the IPW estimator: variance component for the non-probability sample
for the IPW estimator: variance component for the probability sample
for the IPW estimator: the function as defined in the Chen et al. (2020, sec. 3.2, eq. (9)-(10); sec 4.1)
for the DR estimator: the function as defined in the Chen et al. (2020, sec. 3.3., eq. (14); sec 4.1)
for the DR estimator: the function as defined in the Chen et al. (2020, sec. 3.3., eq. (14); sec 4.1)
for the DR estimator: non-probability component of the variance for DR estimator
name of the selected link function for the PS model (character)
model type (character)
Łukasz Chrostowski, Maciej Beręsewicz
# Printing information on the model selected method_ps() # extracting specific field method_ps("cloglog")$make_gradient# Printing information on the model selected method_ps() # extracting specific field method_ps("cloglog")$make_gradient
Returns information on the number of rows of the probability sample (if provided) and non-probability sample.
## S3 method for class 'nonprob' nobs(object, ...)## S3 method for class 'nonprob' nobs(object, ...)
object |
a |
... |
other arguments passed to methods (currently not supported) |
a named vector with row numbers
nonprob function provides an access to the various methods for inference based on non-probability surveys (including big data). The function allows to estimate the population mean based on the access to a reference probability sample (via the survey package), as well as totals or means of covariates.
The package implements state-of-the-art approaches recently proposed in the literature: Chen et al. (2020),
Yang et al. (2020), Wu (2022) and uses the Lumley 2004 survey package for inference (if a reference probability sample is provided).
It provides various inverse probability weighting (e.g. with calibration constraints), mass imputation (e.g. nearest neighbour, predictive mean matching) and doubly robust estimators (e.g. that take into account minimisation of the asymptotic bias of the population mean estimators).
The package uses the survey package functionality when a probability sample is available.
All optional parameters are set to NULL. The obligatory ones include data as well as one of the following three:
selection, outcome, or target – depending on which method has been selected.
In the case of outcome and target multiple variables can be specified.
nonprob( data, selection = NULL, outcome = NULL, target = NULL, svydesign = NULL, pop_totals = NULL, pop_means = NULL, pop_size = NULL, method_selection = c("logit", "cloglog", "probit"), method_outcome = c("glm", "nn", "pmm", "npar"), family_outcome = c("gaussian", "binomial", "poisson"), subset = NULL, strata = NULL, case_weights = NULL, na_action = na.omit, control_selection = control_sel(), control_outcome = control_out(), control_inference = control_inf(), start_selection = NULL, start_outcome = NULL, verbose = FALSE, se = TRUE, ... )nonprob( data, selection = NULL, outcome = NULL, target = NULL, svydesign = NULL, pop_totals = NULL, pop_means = NULL, pop_size = NULL, method_selection = c("logit", "cloglog", "probit"), method_outcome = c("glm", "nn", "pmm", "npar"), family_outcome = c("gaussian", "binomial", "poisson"), subset = NULL, strata = NULL, case_weights = NULL, na_action = na.omit, control_selection = control_sel(), control_outcome = control_out(), control_inference = control_inf(), start_selection = NULL, start_outcome = NULL, verbose = FALSE, se = TRUE, ... )
data |
a |
selection |
a |
outcome |
a |
target |
a |
svydesign |
an optional |
pop_totals |
an optional |
pop_means |
an optional |
pop_size |
an optional |
method_selection |
a |
method_outcome |
a |
family_outcome |
a |
subset |
an optional |
strata |
an optional |
case_weights |
an optional positive, finite |
na_action |
a function which indicates what should happen when the data contain |
control_selection |
a |
control_outcome |
a |
control_inference |
a |
start_selection |
an optional |
start_outcome |
an optional |
verbose |
a logical value (default |
se |
Logical value (default |
... |
Additional, optional arguments (not yet supported) |
Let be the response variable for which we want to estimate the population mean,
given by
For this purpose we consider data integration
with the following structure. Let be the non-probability sample with the design matrix of covariates as
and vector of outcome variable
On the other hand, let be the probability sample with design matrix of covariates be
Instead of a sample of units we can consider a vector of population sums in the form of or means
, where refers to a finite population. Note that we do not assume access to the response variable for .
The implemented estimators assume that outcome values are observed for the non-probability sample ; outcome values observed in the probability sample are not used.
Linked overlap handling for units appearing in both samples is not currently implemented; control_sel() arguments dependence and key are placeholders for future development.
Supported outcome types depend on the estimator family:
IPW: numeric targets whose population mean is meaningful, including continuous, count, and 0/1 binary variables. No outcome model is fitted.
Mass imputation with method_outcome = "glm": continuous, count, or binary variables through family_outcome = "gaussian", "poisson", or "binomial".
Mass imputation with method_outcome = "nn", "pmm", or "npar": numeric targets; categorical, ordinal, survival, and other structured outcomes are not supported.
Doubly robust: GLM outcome models only; use family_outcome = "gaussian", "poisson", or "binomial".
In general we make the following assumptions:
The selection indicator of belonging to non-probability sample and the response variable are independent given the set of covariates .
All units have a non-zero propensity score, i.e., for all i.
The indicator variables and are independent for given and for .
There are three possible approaches to the problem of estimating population mean using non-probability samples:
Inverse probability weighting – the main drawback of non-probability sampling is the unknown selection mechanism for a unit to be included in the sample.
This is why we talk about the so-called "biased sample" problem. The inverse probability approach is based on the assumption that a reference probability sample
is available and therefore we can estimate the propensity score of the selection mechanism.
With inverse probability weights , the package supports two
IPW point-estimator families. The Horvitz-Thompson-type estimator uses an external denominator ,
where are optional case_weights. The Hajek-type estimator uses the estimated IPW total as the denominator,
For IPW-MLE, omitting a fixed population size (pop_size, pop_totals, or pop_means with pop_size) gives
the Hajek-type estimator. Supplying a fixed population size or population totals gives the
Horvitz-Thompson-type estimator. For IPW-GEE with a reference probability sample, the denominator is
sum(weights(svydesign)); gee_h_fun changes the selection-model estimating equation, not the point-estimator
family. For IPW-GEE with population totals or means, the denominator comes from those totals.
For this purpose several estimation methods can be considered. The first approach is maximum likelihood estimation with a corrected
log-likelihood function, which is given by the following formula
In the literature, the main approach to modelling propensity scores is based on the logit link function. However, we extend the propensity score model with the additional link functions such as cloglog and probit. The pseudo-score equations derived from ML methods can be replaced by the idea of generalised estimating equations with calibration constraints defined by equations.
Notice that for We do not need a probability sample and can use a vector of population totals/means.
Mass imputation – This method is based on a framework where imputed values of outcome variables are created for the entire probability sample. In this case, we treat the large sample as a training data set that is used to build an imputation model. Using the imputed values for the probability sample and the (known) design weights, we can build a population mean estimator of the form:
It opens the door to a very flexible method for imputation models. The package uses generalized linear models from stats::glm(),
the nearest neighbour algorithm using RANN::nn2() and predictive mean matching.
Doubly robust estimation – The IPW and MI estimators are sensitive to misspecified models for the propensity score and outcome variable, respectively. To this end, so-called doubly robust methods are presented that take these problems into account. It is a simple idea to combine propensity score and imputation models during inference, leading to the following estimator
In addition, an approach based directly on bias minimisation has been implemented. The following formula
lead us to system of equations
where is a mass imputation (regression) model for the outcome variable and
propensity scores are estimated using a logit function for the model. As with the MLE and GEE approaches we have extended
this method to cloglog and probit links.
As it is not straightforward to calculate the variances of these estimators, asymptotic equivalents of the variances derived using the Taylor approximation have been proposed in the literature. Details can be found here. In addition, the bootstrap approach can be used for variance estimation.
The doubly robust analytic (Taylor-linearisation) variance is derived under the
logistic propensity model (Chen, Li & Wu 2020, Theorem 2). For the probit and
cloglog selection links the probability-sample (design) variance component is a
conservative approximation: it tends to over-estimate the standard error and can be
numerically unstable when fitted propensities approach 1. For doubly robust inference
with those links, control_inf(var_method = "bootstrap") is recommended.
The function also allows variables selection using known methods that have been implemented to handle the integration of probability and non-probability sampling.
In the presence of high-dimensional data, variable selection is important, because it can reduce the variability in the estimate that results from using irrelevant variables to build the model.
Let be the joint estimating function for . We define the
penalized estimating functions as
where and are some smooth functions. We let , where is some penalization function.
Details of penalization functions and techniques for solving this type of equation can be found here.
To use the variable selection model, set the vars_selection parameter in the control_inf() function to TRUE. In addition, in the other control functions such as control_sel() and control_out()
you can set parameters for the selection of the relevant variables, such as the number of folds during cross-validation algorithm or the lambda value for penalizations. Details can be found
in the documentation of the control functions for nonprob.
Returns an object of the nonprob class (it is actually a list) which contains the following elements:
call – the call of the nonprob function
data – a data.frame passed from the nonprob function data argument
X – a model.matrix containing data from probability (first rows) and non-probability samples (next rows) if specified at a function call
y – a list of vector of outcome variables if specified at a function call
R – a numeric vector indicating whether a unit belongs to the probability (0) or non-probability (1) units in the matrix X
ps_scores – a numeric vector of estimated propensity scores for probability and non-probability sample
case_weights – a vector of case weights for non-probability sample based on the call
ipw_weights – a vector of inverse probability weights for non-probability sample (if applicable)
boot_ipw_weights – a matrix of bootstrap inverse probability weights for the non-probability sample when bootstrap variance estimation is used and bootstrap results are kept (if applicable)
control – a list of control functions based on the call
output – a data.frame with the estimated means and standard errors for the variables specified in the target or outcome arguments
SE – a data.frame with standard error of the estimator of the population mean, divided into errors from probability and non-probability samples (if applicable)
confidence_interval – a data.frame with confidence interval of population mean estimator
nonprob_size – a scalar numeric vector denoting the size of non-probability sample
prob_size – a scalar numeric vector denoting the size of probability sample
pop_size – a scalar numeric vector estimated population size derived from estimated weights (non-probability sample) or known design weights (probability sample)
pop_size_fixed – a logical value whether the population size was fixed (known) or estimated (unknown)
ipw_estimator – a character value with the IPW point-estimator family ("ht" or "hajek", if applicable)
ipw_denominator – a scalar numeric vector with the denominator used for the IPW point estimator (if applicable)
ipw_denominator_source – a character value describing the source of the IPW point-estimator denominator (if applicable)
pop_totals – a numeric vector with the total values of the auxiliary variables derived from a probability sample or based on the call
pop_means – a numeric vector with the mean values of the auxiliary variables derived from a probability sample or based on the call
outcome – a list containing information about the fitting of the mass imputation model. Structure of the object is based on the method_outcome and family_outcome arguments which point to specific methods as defined by functions method_* (if specified in the call)
selection – a list containing information about the fitting of the propensity score model. Structure of the object is based on the method_selection argument which point to specific methods as defined by functions method_ps (if specified in the call)
boot_sample – a matrix with bootstrap estimates of the target variable(s) (if specified)
svydesign – a svydesign2 object (if specified)
ys_rand_pred – a list of predicted values for the target variable(s) for the probability sample (for the MI and DR estimator)
ys_nons_pred – a list of predicted values for the target variable(s) for the non-probability sample (for the MI and DR estimator)
ys_resid – a list of residuals for the target variable(s) for the non-probability sample (for the MI and DR estimator)
estimator – a character vector with information what type of estimator was selected (one of c("ipw", "mi", "dr")).
selection_formula – a formula based on the selection argument (if specified)
estimator_method – a character vector with information on the detailed method applied (for the print method)
Łukasz Chrostowski, Maciej Beręsewicz, Piotr Chlebicki
Kim JK, Park S, Chen Y, Wu C. Combining non-probability and probability survey samples through mass imputation. J R Stat Soc Series A. 2021;184:941– 963.
Shu Yang, Jae Kwang Kim, Rui Song. Doubly robust inference when combining probability and non-probability samples with high dimensional data. J. R. Statist. Soc. B (2020)
Yilin Chen , Pengfei Li & Changbao Wu (2020) Doubly Robust Inference With Nonprobability Survey Samples, Journal of the American Statistical Association, 115:532, 2011-2021
Shu Yang, Jae Kwang Kim and Youngdeok Hwang Integration of data from probability surveys and big found data for finite population inference using mass imputation. Survey Methodology, June 2021 29 Vol. 47, No. 1, pp. 29-58
stats::optim() – For more information on the optim function used in the
optim method of propensity score model fitting.
maxLik::maxLik() – For more information on the maxLik function used in
maxLik method of propensity score model fitting.
ncvreg::cv.ncvreg() – For more information on the cv.ncvreg function used in
variable selection for the outcome model.
nleqslv::nleqslv() – For more information on the nleqslv function used in
estimation process of the bias minimization approach.
stats::glm() – For more information about the generalised linear models used during mass imputation process.
RANN::nn2() – For more information about the nearest neighbour algorithm used during mass imputation process.
control_sel() – For the control parameters related to selection model.
control_out() – For the control parameters related to outcome model.
control_inf() – For the control parameters related to statistical inference.
# generate data based on Doubly Robust Inference With Non-probability Survey Samples (2021) # Yilin Chen , Pengfei Li & Changbao Wu set.seed(123) # sizes of population and probability sample N <- 20000 # population n_b <- 1000 # probability # data z1 <- rbinom(N, 1, 0.7) z2 <- runif(N, 0, 2) z3 <- rexp(N, 1) z4 <- rchisq(N, 4) # covariates x1 <- z1 x2 <- z2 + 0.3 * z2 x3 <- z3 + 0.2 * (z1 + z2) x4 <- z4 + 0.1 * (z1 + z2 + z3) epsilon <- rnorm(N) sigma_30 <- 10.4 sigma_50 <- 5.2 sigma_80 <- 2.4 # response variables y30 <- 2 + x1 + x2 + x3 + x4 + sigma_30 * epsilon y50 <- 2 + x1 + x2 + x3 + x4 + sigma_50 * epsilon y80 <- 2 + x1 + x2 + x3 + x4 + sigma_80 * epsilon # population sim_data <- data.frame(y30, y50, y80, x1, x2, x3, x4) ## propensity score model for non-probability sample (sum to 1000) eta <- -4.461 + 0.1 * x1 + 0.2 * x2 + 0.1 * x3 + 0.2 * x4 rho <- plogis(eta) # inclusion probabilities for probability sample z_prob <- x3 + 0.2051 sim_data$p_prob <- n_b* z_prob/sum(z_prob) # data sim_data$flag_nonprob <- as.numeric(runif(N) < rho) ## sampling nonprob sim_data$flag_prob <- as.numeric(runif(n_b) < sim_data$p_prob) ## sampling prob nonprob_df <- subset(sim_data, flag_nonprob == 1) ## non-probability sample svyprob <- svydesign( ids = ~1, probs = ~p_prob, data = subset(sim_data, flag_prob == 1), pps = "brewer" ) ## probability sample ## mass imputation estimator mi_res <- nonprob( outcome = y30 + y50 + y80 ~ x1 + x2 + x3 + x4, data = nonprob_df, svydesign = svyprob ) mi_res ## inverse probability weighted estimator ipw_res <- nonprob( selection = ~ x1 + x2 + x3 + x4, target = ~y30 + y50 + y80, data = nonprob_df, svydesign = svyprob ) ipw_res ## doubly robust estimator dr_res <- nonprob( outcome = y30 + y50 + y80 ~ x1 + x2 + x3 + x4, selection = ~ x1 + x2 + x3 + x4, data = nonprob_df, svydesign = svyprob ) dr_res# generate data based on Doubly Robust Inference With Non-probability Survey Samples (2021) # Yilin Chen , Pengfei Li & Changbao Wu set.seed(123) # sizes of population and probability sample N <- 20000 # population n_b <- 1000 # probability # data z1 <- rbinom(N, 1, 0.7) z2 <- runif(N, 0, 2) z3 <- rexp(N, 1) z4 <- rchisq(N, 4) # covariates x1 <- z1 x2 <- z2 + 0.3 * z2 x3 <- z3 + 0.2 * (z1 + z2) x4 <- z4 + 0.1 * (z1 + z2 + z3) epsilon <- rnorm(N) sigma_30 <- 10.4 sigma_50 <- 5.2 sigma_80 <- 2.4 # response variables y30 <- 2 + x1 + x2 + x3 + x4 + sigma_30 * epsilon y50 <- 2 + x1 + x2 + x3 + x4 + sigma_50 * epsilon y80 <- 2 + x1 + x2 + x3 + x4 + sigma_80 * epsilon # population sim_data <- data.frame(y30, y50, y80, x1, x2, x3, x4) ## propensity score model for non-probability sample (sum to 1000) eta <- -4.461 + 0.1 * x1 + 0.2 * x2 + 0.1 * x3 + 0.2 * x4 rho <- plogis(eta) # inclusion probabilities for probability sample z_prob <- x3 + 0.2051 sim_data$p_prob <- n_b* z_prob/sum(z_prob) # data sim_data$flag_nonprob <- as.numeric(runif(N) < rho) ## sampling nonprob sim_data$flag_prob <- as.numeric(runif(n_b) < sim_data$p_prob) ## sampling prob nonprob_df <- subset(sim_data, flag_nonprob == 1) ## non-probability sample svyprob <- svydesign( ids = ~1, probs = ~p_prob, data = subset(sim_data, flag_prob == 1), pps = "brewer" ) ## probability sample ## mass imputation estimator mi_res <- nonprob( outcome = y30 + y50 + y80 ~ x1 + x2 + x3 + x4, data = nonprob_df, svydesign = svyprob ) mi_res ## inverse probability weighted estimator ipw_res <- nonprob( selection = ~ x1 + x2 + x3 + x4, target = ~y30 + y50 + y80, data = nonprob_df, svydesign = svyprob ) ipw_res ## doubly robust estimator dr_res <- nonprob( outcome = y30 + y50 + y80 ~ x1 + x2 + x3 + x4, selection = ~ x1 + x2 + x3 + x4, data = nonprob_df, svydesign = svyprob ) dr_res
Simple plotting method that compares the estimated mean(s) and CI(s) with the naive (uncorrected) estimates.
## S3 method for class 'nonprob' plot(x, ...)## S3 method for class 'nonprob' plot(x, ...)
x |
the |
... |
other arguments passed to the plot method (currently not supported) |
data(admin) data(jvs) jvs_svy <- svydesign(ids = ~ 1, weights = ~ weight, strata = ~ size + nace + region, data = jvs) ipw_est1 <- nonprob(selection = ~ region + private + nace + size, target = ~ single_shift, svydesign = jvs_svy, data = admin, method_selection = "logit") plot(ipw_est1)data(admin) data(jvs) jvs_svy <- svydesign(ids = ~ 1, weights = ~ weight, strata = ~ size + nace + region, data = jvs) ipw_est1 <- nonprob(selection = ~ region + private + nace + size, target = ~ single_shift, svydesign = jvs_svy, data = admin, method_selection = "logit") plot(ipw_est1)
Returns population size that is assumed to be
fixed – if it is based on the pop_size argument or population totals,
estimated – if it is based on the probability survey specified in the svydesign or, for Hajek-type IPW-MLE, on the estimated IPW total for the non-probability sample.
pop_size(object)pop_size(object)
object |
object returned by the |
a scalar returning the value of the population size.
sample_a <- data.frame(y = c(1, 0, 1, 0, 1), x = c(0, 1, 2, 3, 4)) sample_b <- data.frame(x = c(0.5, 1.5, 2.5, 3.5), w = c(4, 4, 4, 4)) sample_b_svy <- svydesign(ids = ~1, weights = ~w, data = sample_b) ipw_est1 <- nonprob( selection = ~x, target = ~y, svydesign = sample_b_svy, data = sample_a, method_selection = "logit", se = FALSE ) ipw_est2 <- nonprob( selection = ~x, target = ~y, svydesign = sample_b_svy, data = sample_a, method_selection = "logit", control_selection = control_sel(est_method = "gee", gee_h_fun = 1), se = FALSE ) ## estimated population size based on the non-calibrated IPW (MLE) pop_size(ipw_est1) ## estimated population size based on the calibrated IPW (GEE) pop_size(ipw_est2)sample_a <- data.frame(y = c(1, 0, 1, 0, 1), x = c(0, 1, 2, 3, 4)) sample_b <- data.frame(x = c(0.5, 1.5, 2.5, 3.5), w = c(4, 4, 4, 4)) sample_b_svy <- svydesign(ids = ~1, weights = ~w, data = sample_b) ipw_est1 <- nonprob( selection = ~x, target = ~y, svydesign = sample_b_svy, data = sample_a, method_selection = "logit", se = FALSE ) ipw_est2 <- nonprob( selection = ~x, target = ~y, svydesign = sample_b_svy, data = sample_a, method_selection = "logit", control_selection = control_sel(est_method = "gee", gee_h_fun = 1), se = FALSE ) ## estimated population size based on the non-calibrated IPW (MLE) pop_size(ipw_est1) ## estimated population size based on the calibrated IPW (GEE) pop_size(ipw_est2)
Print method for the nonprob_summary object which allows for specification
what should be printed or not.
## S3 method for class 'nonprob_summary' print(x, resid = TRUE, pred = TRUE, digits = 4, ...)## S3 method for class 'nonprob_summary' print(x, resid = TRUE, pred = TRUE, digits = 4, ...)
x |
a |
resid |
whether distribution of residuals should be printed (default is |
pred |
whether distribution of predictions should be printed (default is |
digits |
number of digits to be printed (default 4) |
... |
further parameters passed to the print method (currently not supported) |
Summarises the nonprob class object. The summary depends on the type of
the estimator (i.e. IPW, MI, DR)
## S3 method for class 'nonprob' summary(object, ...)## S3 method for class 'nonprob' summary(object, ...)
object |
object of the |
... |
Additional optional arguments |
An object of nonprob_summary class containing:
call call
estimator type of estimator
control list of controls
ipw_weights estimated IPW weights
ipw_weights_total estimated IPW total (sum)
ps_scores_nonprob estimated propensity scores for non-probability sample
ps_scores_prob estimated propensity scores for probability sample
case_weights case weights
output estimated means and standard errors
SE estimated standard errors of V1 and V2
confidence_interval confidence intervals
nonprob_size size of the non-probability sample
prob_size size of the probability sample
pop_size population size
pop_size_fixed whether the population size is treated as fixed
ipw_estimator IPW point-estimator family ("ht" or "hajek")
ipw_denominator denominator used for the IPW point estimator
ipw_denominator_source source of the IPW point-estimator denominator
no_prob whether probability sample was provided
outcome model details
selection selection details
estimator_method estimator method
selection_formula selection formula
outcome_formula outcome formula
vars_selection whether variable selection algorithm was applied
vars_outcome variables of the outcome models
ys_rand_pred predicted values for the random sample (if applies)
ys_nons_pred predicted values for the non-probability sample
ys_resid residuals for the non-probability sample
data(admin) data(jvs) jvs_svy <- svydesign(ids = ~ 1, weights = ~ weight, strata = ~ size + nace + region, data = jvs) ipw_est1 <- nonprob(selection = ~ region + private + nace + size, target = ~ single_shift, svydesign = jvs_svy, data = admin, method_selection = "logit" ) summary(ipw_est1)data(admin) data(jvs) jvs_svy <- svydesign(ids = ~ 1, weights = ~ weight, strata = ~ size + nace + region, data = jvs) ipw_est1 <- nonprob(selection = ~ region + private + nace + size, target = ~ single_shift, svydesign = jvs_svy, data = admin, method_selection = "logit" ) summary(ipw_est1)
The update method for the nonprob class object that allows to re-estimate
a given model with changed parameters. This is in particular useful if a user
would like to change method or estimate standard errors if they were not
estimated in the first place.
## S3 method for class 'nonprob' update(object, ..., evaluate = TRUE)## S3 method for class 'nonprob' update(object, ..., evaluate = TRUE)
object |
the |
... |
arguments passed to the |
evaluate |
If true evaluate the new call else return the call |
returns a nonprob object
Maciej Beręsewicz
data(admin) data(jvs) jvs_svy <- svydesign(ids = ~ 1, weights = ~ weight, strata = ~ size + nace + region, data = jvs) ipw_est1 <- nonprob(selection = ~ region + private + nace + size, target = ~ single_shift, svydesign = jvs_svy, data = admin, method_selection = "logit", se = FALSE ) ipw_est1 update(ipw_est1, se = TRUE)data(admin) data(jvs) jvs_svy <- svydesign(ids = ~ 1, weights = ~ weight, strata = ~ size + nace + region, data = jvs) ipw_est1 <- nonprob(selection = ~ region + private + nace + size, target = ~ single_shift, svydesign = jvs_svy, data = admin, method_selection = "logit", se = FALSE ) ipw_est1 update(ipw_est1, se = TRUE)
A generic function weights that returns inverse probability weights (if present)
## S3 method for class 'nonprob' weights(object, ...)## S3 method for class 'nonprob' weights(object, ...)
object |
a |
... |
other arguments passed to methods (currently not supported) |
A vector of weights or a NULL extracted from the nonprob object i.e. element "ipw_weights"
data(admin) data(jvs) jvs_svy <- svydesign(ids = ~ 1, weights = ~ weight, strata = ~ size + nace + region, data = jvs) ipw_est1 <- nonprob(selection = ~ region + private + nace + size, target = ~ single_shift, svydesign = jvs_svy, data = admin, method_selection = "logit", se = FALSE ) summary(weights(ipw_est1))data(admin) data(jvs) jvs_svy <- svydesign(ids = ~ 1, weights = ~ weight, strata = ~ size + nace + region, data = jvs) ipw_est1 <- nonprob(selection = ~ region + private + nace + size, target = ~ single_shift, svydesign = jvs_svy, data = admin, method_selection = "logit", se = FALSE ) summary(weights(ipw_est1))