NEWS

nonprobsvy 0.3.0

print() and summary() now describe the IPW point estimator concisely: the estimator-type line reads IPW (Hajek, denominator: <N>) or IPW (HT, denominator: <N>) — HT only when pop_size is user-specified (i.e. pop_size_fixed = TRUE), otherwise Hajek with denominator round(sum(ipw_weights)); the separate IPW point estimator: line is removed for pure IPW objects and kept (corrected) only for doubly robust objects
fixed control_out() treetype choices: corrected from c("kd", "rp", "ball") to c("kd", "bd") to match what RANN::nn2() actually accepts; "rp" and "ball" were not valid RANN tree types and caused a crash, while "bd" (a legitimate RANN option) was wrongly rejected
changed the default of num_boot in control_inf() from 500 to 100 and nlambda in control_out() from 100 to 50 to align with the documentation
vignette based on the JSS paper added
documentation notation adjusted to match the JSS paper
we thank the authors of StatsClaw.ai, the tool that allowed us to identify bugs and improve the code
documented and stabilised the doubly robust analytic variance for non-logit propensity links: the plug-in doubly robust variance (Chen, Li & Wu 2020, Theorem 2, eq. 14) is derived under the logistic model, where the probability-sample (design) variance factor pi is bounded by 1; for probit (inverse-Mills factor phi / (1 - pi)) and cloglog (factor log(1 - pi) == -exp(eta)) it is conservative (over-estimates the standard error) and previously could overflow to a huge or non-finite SE when a fitted propensity approached 0 or 1 (in a small-propensity simulation the probit analytic SE reached ~1e27). The non-logit doubly robust variance terms (t_vec, var_nonprob, and the b-vector weights psd / pi^2 and (1 - pi) / pi^2 * exp(eta)) now floor the fitted propensity away from {0, 1} at 1 / sqrt(N) so they stay finite, a one-time note recommends control_inf(var_method = "bootstrap") for probit/cloglog doubly robust inference, and the documentation states the conservativeness. The logistic path is left untouched (its standard error is unchanged). The previously commented-out probit/cloglog doubly robust tests are re-enabled (point estimates pinned, analytic SE asserted finite), validated by a coverage simulation in which logit stays at nominal coverage (~0.95, SE/SD ~1) and the probit/cloglog SE no longer explodes
added input validation at the boundary: a factor or character outcome/target now raises a clear error (instead of silently returning NA), and the not-yet-implemented overlap controls control_sel(dependence = TRUE) / key = ... now error with a clear message instead of being silently accepted
fixed a crash in IPW/DR variable selection on the population-totals-only path (no svydesign): once cross-validation dropped a covariate, the response was re-extracted by routing the still-full outcome formula through the population-totals model frame, whose name check then failed with "Selection and population totals have different names."; the response is now read directly from the data, independent of the reduced totals. A warning is also emitted flagging this path as experimental, because the penalty (lambda) selection is unreliable here and can collapse to an over-sparse (even intercept-only) model that biases the estimate -- supplying a unit-level svydesign or a fixed control_sel(lambda = ...) is recommended instead
fixed the cross-validation loss for IPW variable selection in the population-totals-only path: the known population totals were normalised by sum(weights) (which equals the non-probability sample size when no probability sample is present) instead of sum(1/pi_hat) (the HT estimate of N), mis-scaling the calibration target by ~N/n and mis-directing the penalty (lambda) selection; the totals term now uses the same N_nons normalisation as the estimating equation
stabilised the nonparametric (npar) mass-imputation analytic variance: the loess inclusion-propensity proxy can return improper fitted values (<= 0 or near zero), which made the 1/pi_hat^2 weight spurious or non-finite; the propensity is now floored at 1/sqrt(N) (so each unit's inverse-propensity weight is capped at N) before inverting
corrected the documented general c-vector formula in the GLM mass-imputation analytic variance (method_glm()): the equation had been transcribed from Kim et al. (2021, eq. 9) in that paper's labelling, where sample B is the non-probability sample and A the probability sample -- the opposite of this package's convention (A = non-probability). The sample roles in the general formula are now swapped to match the package convention, the linear special case, and the (already correct) implementation; this is a documentation-only fix with no change to computed variances
documentation polishing: corrected the documented epsilon defaults in control_out() (1e-8) and control_sel() (1e-4) to match the code, fixed a var_tot -> var_total return-value name in the method_nn() documentation, removed a stray character in the method_npar() family_outcome parameter doc, and dropped a redundant sum() in the printed IPW-weights total
validated that a supplied pop_totals is a named vector whose first element is (Intercept), erroring early with a clear message instead of silently producing an NA population size and NaN standard errors
fixed confint() to honor the requested level regardless of the fit's control_inf(alpha); previously confint(level = 0.95) returned the stored interval built with the fit's alpha, so a model fitted with e.g. alpha = 0.1 had its 90% interval mislabelled as 95%
enabled the Rcpp variable-selection non-convergence warning (previously dead code) and fixed an off-by-one so the solver runs the full maxit iterations; the warning now fires under verbose = TRUE
corrected the displayed mass-imputation GLM V_1 variance formula in the documentation (squared residual and squared bracket); the implementation was already correct
fixed the doubly robust analytic standard error under control_inf(bias_correction = TRUE): the non-probability variance component (Yang-Kim-Song 2020, eq. 25) used the frequency case_weights instead of the inverse-probability weights 1/pi, which made the standard error anticonservative (95% confidence-interval coverage ~0.87 instead of 0.95 in simulation); it now uses the bias-corrected inverse-probability weights and is computed per outcome so an outcome's SE no longer depends on the other outcomes fitted alongside it
fixed the probit propensity-score analytic variance under control_sel(est_method = "gee"): a scalar ifelse() truncated the propensity-score derivative vectors (ps_nons_der, est_ps_rand_der) to length 1, which corrupted the probability-sample variance component and inflated the standard error; replaced it with a scalar if/else and added a probit-GEE analytic-SE regression test
warm-started the single-core IPW pop_totals GEE bootstrap to match the multicore path (closes #120)
warm-started the DR bias_correction bootstrap solver from the original-data fit, speeding it up with identical estimates (closes #119)
added control_out(pmm_k_max = ...) to cap the pmm_k_choice = "min_var" search grid and documented its cost (closes #116)
added a regression test for finite analytic SE with pop_totals and a non-gaussian family_outcome (closes #71)
fixed parallel bootstrap (cores > 1) so set.seed() makes results reproducible by switching the MI, IPW, and DR multicore loops from foreach::%dopar% to doRNG::%dorng%; this also seeds NN/PMM tie-breaking inside workers (closes #115)
fixed an indefinite hang in the MI bootstrap variance path for method_outcome %in% c("nn", "pmm"): boot_mi() previously hand-mutated a subset of the svyrep.design slots, which left $selfrep at the original length and made survey::svytotal() inside method_nn()/method_pmm() fail deterministically with "(subscript) logical subscript too long"; a silent tryCatch around the per-replicate loop swallowed the error without advancing the counter, so the call ran forever -- the subset now goes through [.svyrep.design and a bounded retry surfaces any deterministic failure as a proper error
implemented the Kim & Haziza (2014) low-dimensional joint estimator for control_inf(bias_correction = TRUE) without variable selection, replacing the previous object 'bias_corr_ys_rand_pred' not found crash; the DR analytic and bootstrap variance paths now route through the same joint-estimation helper used by the Yang-Kim-Song (2020) high-dimensional path, and bias_correction = TRUE without svydesign is rejected up front (closes #114) -- thanks to Tommy Nyberg for reporting.
shortened CRAN-facing examples, added spelling wordlist entries, and made R CMD check run a fast tinytest subset by default while keeping the full suite available on GitHub via NONPROBSVY_FULL_TESTS=true
corrected IPW-GEE analytic variance components for h(x)=x/pi, the GEE b-vector sign, and probit/cloglog probability-sample scaling for h(x)=x, and added plain-R/Rcpp GEE formula checks
aligned MLE IPW analytic variance components for logit, probit, and cloglog propensity models with the shared pi_dot formula and nonprobability-sample IPW plug-in scale, including the known-N probit adjustment
allowed outcome variable-selection models to use a user-specified control_out(lambda = ...) value instead of always running cross-validation (closes #66)
returned bootstrap IPW weights in boot_ipw_weights when bootstrap variance estimation is used and bootstrap results are kept (closes #76)
improved the Rcpp variable-selection cross-validation code, fixed SCAD penalty tapering, switched fold sampling to R's RNG for set.seed() reproducibility, and added benchmark evidence for the speedup (closes #103)
added DR regression tests for one-outcome versus multi-outcome analytic uncertainty-component consistency and multi-outcome bootstrap output shapes (addresses part of #101)
added MI regression tests for one-outcome versus multi-outcome output, confidence-interval, and uncertainty-component consistency across GLM, NN, PMM, and NPAR backends (addresses part of #101)
added IPW regression tests for one-outcome versus multi-outcome analytic variance, HT versus Hajek denominator metadata, and probit/cloglog variable-selection smoke coverage (addresses part of #101)
made make_outcomes() return names explicit so callers no longer rely on partial $ matching (closes #100)
stabilized extreme-propensity IPW computations by clamping fitted probabilities, using tail-stable log-likelihood formulas, and guarding Rcpp variable-selection weights (closes #102)
clarified supported data structures, estimator-by-outcome scope, README examples, and verbose argument documentation (closes #95, #97, #98)
tightened validation for pop_means, case_weights, and logical inference-control flags so unsupported inputs fail early with clearer messages (closes #96)
fixed multicore IPW bootstrap for population-totals-only runs and kept bootstrap replicate output shapes consistent (closes #94)
aligned IPW point-estimator denominators with Horvitz-Thompson vs Hajek estimator behavior, added estimator-family metadata and print output, and documented when each estimator is used (closes #89)
fixed incorrect analytic uncertainty for multi-outcome IPW and DR fits by aligning outcome-specific variance and confidence-interval indexing (closes #87, #88)
fixed the NN mass-imputation k = 1 path by avoiding dimension drop errors and using leave-one-out matching for the non-probability variance proxy (closes #92)
fixed nn_exact_se = TRUE so the mini-bootstrap uses bootstrap-specific nearest-neighbor matches instead of reusing the original donor matches (closes #91)
fixed PMM pmm_k_choice = "min_var" so the best k found so far is returned instead of the first non-improving k (closes #93)
fixed nonprob_mi() model-frame construction to pass case_weights instead of an undefined internal weights symbol; this is a plumbing fix and does not otherwise change current case-weight semantics (closes #99)
reused nearest-neighbor search results across outcomes for multi-outcome NN mass-imputation fits (closes #104)
randomized equal-distance nearest-neighbor tie handling before donor aggregation, including hidden cutoff ties beyond the neighbours initially returned by RANN::nn2() (closes #105)
changed PMM pmm_k_choice = "min_var" to perform a full search over the candidate k grid rather than stopping at the first non-improving value (closes #106)
aligned NN and PMM point estimates and probability-side analytic variances with known-N Horvitz-Thompson means when pop_size is supplied (closes #107)
updated NN/PMM function documentation for known-N means, leave-one-out NN variance proxy, weighted mini-bootstrap, random tie handling, and full-grid PMM k selection (closes #108)
added regression tests for multi-outcome analytic variance, NN k handling, bootstrap-based NN exact SE, and PMM k choice

nonprobsvy 0.2.3 (2025-08-20)

changes to the documentation to meet the JSS requirements
documentation polishing

nonprobsvy 0.2.2 (2025-05-24)

new hex sticker by Oliwia Awuku
minor changes to the code, e.g. control_out(eps=1e-8)
fixing a bug in the bootstrap variance estimator the method_nn and method_pmm
fixing bootstrap for doubly robust estimators
more unit-tests for doubly robust estimators and other methods
more informative vignette for method_glm

nonprobsvy 0.2.1 (2025-04-22)

titles corrected
new S3 method extract added which allows to extract results from the nonprob object
new S3 method coef added which allows to obtain the coefficients of underlying models (if possible)
fixed CRAN notes (unit tests for the IPW estimator cloglog)
removed sampling package from suggested package
added simple plot method
improvements in the linear algebra
corrected the check_balance error (closes #75)
code cleaning

nonprobsvy 0.2.0 (2025-03-27)

Breaking changes

functions pop.size, controlSel, controlOut and controlInf were renamed to pop_size, control_sel, control_out and control_inf respectively.
function genSimData removed completely as it is not used anywhere in the package.
argument maxLik_method renamed to maxlik_method in the control_sel function.
control_out function:
- predictive_match renamed to pmm_match_type to align with the PMM (Predictive Mean Matching) estimator naming convention, where all related parameters start with pmm_
control_sel function:
- argument method removed as it was not used
- argument est_method_sel renamed to est_method
- argument h renamed to gee_h_fun to make this more readable to the user
- start_type now accepts only zero and mle (for gee models only).
control_inf function:
- bias_inf renamed to vars_combine and type changed to logical. TRUE if variables (its levels) should be combined after variable selection algorithm for the doubly robust approach.
- pi_ij -- argument removed as it is not used.
nonprobsvy class renamed to nonprob and all related method adjusted to this change
functions logit_model_nonprobsvy, probit_model_nonprobsvy and cloglog_model_nonprobsvy removed in the favour of more readable method_ps function that specifies the propensity score model
new option control_inference=control_inf(vars_combine=TRUE) which allows doubly robust estimator to combine variables prior estimation i.e. if selection=~x1+x2 and y~x1+x3 then the following models are fitted selection=~x1+x2+x3 and y~x1+x2+x3. By default we set control_inference=control_inf(vars_combine=FALSE). Note that this behaviour is assumed independently from variable selection.
argument nonprob(weights=NULL) replaced to nonprob(case_weights=NULL) to stress that this refer to case weights not sampling or other weights in non-probability sample

Features

two additional datasets have been included: jvs (Job Vacancy Survey; a probability sample survey) and admin (Central Job Offers Database; a non-probability sample survey). The units and auxiliary variables have been aligned in a way that allows the data to be integrated using the methods implemented in this package.
a check_balance function was added to check the balance in the totals of the variables based on the weighted weights between the non-probability and probability samples.
citation file added.
argument na_action with default na.omit
new generic methods added:
- weights -- returns IPW weights
- update -- allows to update the nonprob class object
new functions added and exported:
- method_ps -- for modelling propensity score
- method_glm -- for modelling y using glm function
- method_nn -- for the NN method
- method_pmm -- for the PMM method
- method_npar -- for the non-parametric method
new print.nonprob, summary.nonprob and print.nonprob_summary methods

> result_mi
A nonprob object
 - estimator type: mass imputation
 - method: glm (gaussian)
 - auxiliary variables source: survey
 - vars selection: false
 - variance estimator: analytic
 - population size fixed: false
 - naive (uncorrected) estimators:
   - variable y1: 3.1817
   - variable y2: 1.8087
 - selected estimators:
   - variable y1: 2.9498 (se=0.0420, ci=(2.8674, 3.0322))
   - variable y2: 1.5760 (se=0.0326, ci=(1.5122, 1.6399))

number of digits can be changed using print(x, digits) as shown below

> print(result_mi,2)
A nonprob object
 - estimator type: mass imputation
 - method: glm (gaussian)
 - auxiliary variables source: survey
 - vars selection: false
 - variance estimator: analytic
 - population size fixed: false
 - naive (uncorrected) estimators:
   - variable y1: 3.18
   - variable y2: 1.81
 - selected estimators:
   - variable y1: 2.95 (se=0.04, ci=(2.87, 3.03))
   - variable y2: 1.58 (se=0.03, ci=(1.51, 1.64))

> summary(result_mi) |> print(digits=2)
A nonprob_summary object
 - call: nonprob(data = subset(population, flag_bd1 == 1), outcome = y1 + 
    y2 ~ x1 + x2, svydesign = sample_prob)
 - estimator type: mass imputation
 - nonprob sample size: 693011 (69.3%)
 - prob sample size: 1000 (0.1%)
 - population size: 1000000 (fixed: false)
 - detailed information about models are stored in list element(s): "outcome"
----------------------------------------------------------------
 - distribution of outcome residuals:
   - y1: min: -4.79; mean: 0.00; median: 0.00; max: 4.54
   - y2: min: -4.96; mean: -0.00; median: -0.07; max: 12.25
 - distribution of outcome predictions (nonprob sample):
   - y1: min: -2.72; mean: 3.18; median: 3.04; max: 16.28
   - y2: min: -1.55; mean: 1.81; median: 1.58; max: 13.92
 - distribution of outcome predictions (prob sample):
   - y1: min: -0.46; mean: 2.95; median: 2.84; max: 10.31
   - y2: min: -0.58; mean: 1.58; median: 1.39; max: 7.87
----------------------------------------------------------------

Bugfixes

basic methods and functions related to variance estimation, weights and probability linking methods have been rewritten in a more optimal and readable way.

Other

more informative error messages added.
documentation improved.
switching completely to snake_case.
extensive cleaning of the code.
more unit-tests added.
new dependencies: formula.tools

Documentation

annotation has been added that argument strata is not supported for the time being.

Replication materials

to verify the quality of the software please refer to the replication materials available here: https://github.com/ncn-foreigners/software-tutorials

nonprobsvy 0.1.1 (2024-11-14)

Bugfixes

bug Fix occurring when estimation was based on auxiliary variable, which led to compression of the data from the frame to the vector.
bug Fix related to not passing maxit argument from controlSel function to internally used nleqslv function
bug Fix related to storing vector in model_frame when predicting y_hat in mass imputation glm model when X is based in one auxiliary variable only - fix provided converting it to data.frame object.

Features

added information to summary about quality of estimation basing on difference between estimated and known total values of auxiliary variables
added estimation of exact standard error for k-nearest neighbor estimator.
added breaking change to controlOut function by switching values for predictive_match argument. From now on, the predictive_match = 1 means $\hat{y}-\hat{y}$ in predictive mean matching imputation and predictive_match = 2 corresponds to $\hat{y}-y$ matching.
implemented div option when variable selection (more in documentation) for doubly robust estimation.
added more insights to nonprob output such as gradient, hessian and jacobian derived from IPW estimation for mle and gee methods when IPW or DR model executed.
added estimated inclusion probabilities and its derivatives for probability and non-probability samples to nonprob output when IPW or DR model executed.
added model_frame matrix data from probability sample used for mass imputation to nonprob when MI or DR model executed.

Unit tests

added unit tests for variable selection models and mi estimation with vector of population totals available

nonprobsvy 0.1.0 (2024-04-04)

Features

implemented population mean estimation using doubly robust, inverse probability weighting and mass imputation methods
implemented inverse probability weighting models with Maximum Likelihood Estimation and Generalized Estimating Equations methods with logit, complementary log-log and probit link functions.
implemented generalized linear models, nearest neighbours and predictive mean matching methods for Mass Imputation
implemented bias correction estimators for doubly-robust approach
implemented estimation methods when vector of population means/totals is available
implemented variables selection with SCAD, LASSO and MCP penalization equations
implemented analytic and bootstrap (with parallel computation - doSNOW package) variance for described estimators
added control parameters for models
added S3 methods for object of nonprob class such as
- nobs for samples size
- pop.size for population size estimation
- residuals for residuals of the inverse probability weighting model
- cooks.distance for identifying influential observations that have a significant impact on the parameter estimates
- hatvalues for measuring the leverage of individual observations
- logLik for computing the log-likelihood of the model,
- AIC (Akaike Information Criterion) for evaluating the model based on the trade-off between goodness of fit and complexity, helping in model selection
- BIC (Bayesian Information Criterion) for a similar purpose as AIC but with a stronger penalty for model complexity
- confint for calculating confidence intervals around parameter estimates
- vcov for obtaining the variance-covariance matrix of the parameter estimates
- deviance for assessing the goodness of fit of the model

Unit tests

added unit tests for IPW estimators.

Github repository

added automated R-cmd check

Documentation

added documentation for nonprob function.