Changes in version 0.3.0 - print() and summary() now describe the IPW point estimator concisely: the estimator-type line reads IPW (Hajek, denominator: ) or IPW (HT, denominator: ) — HT only when pop_size is user-specified (i.e. pop_size_fixed = TRUE), otherwise Hajek with denominator round(sum(ipw_weights)); the separate IPW point estimator: line is removed for pure IPW objects and kept (corrected) only for doubly robust objects - fixed control_out() treetype choices: corrected from c("kd", "rp", "ball") to c("kd", "bd") to match what RANN::nn2() actually accepts; "rp" and "ball" were not valid RANN tree types and caused a crash, while "bd" (a legitimate RANN option) was wrongly rejected - changed the default of num_boot in control_inf() from 500 to 100 and nlambda in control_out() from 100 to 50 to align with the documentation - vignette based on the JSS paper added - documentation notation adjusted to match the JSS paper - we thank the authors of StatsClaw.ai, the tool that allowed us to identify bugs and improve the code - documented and stabilised the doubly robust analytic variance for non-logit propensity links: the plug-in doubly robust variance (Chen, Li & Wu 2020, Theorem 2, eq. 14) is derived under the logistic model, where the probability-sample (design) variance factor pi is bounded by 1; for probit (inverse-Mills factor phi / (1 - pi)) and cloglog (factor log(1 - pi) == -exp(eta)) it is conservative (over-estimates the standard error) and previously could overflow to a huge or non-finite SE when a fitted propensity approached 0 or 1 (in a small-propensity simulation the probit analytic SE reached ~1e27). The non-logit doubly robust variance terms (t_vec, var_nonprob, and the b-vector weights psd / pi^2 and (1 - pi) / pi^2 * exp(eta)) now floor the fitted propensity away from {0, 1} at 1 / sqrt(N) so they stay finite, a one-time note recommends control_inf(var_method = "bootstrap") for probit/cloglog doubly robust inference, and the documentation states the conservativeness. The logistic path is left untouched (its standard error is unchanged). The previously commented-out probit/cloglog doubly robust tests are re-enabled (point estimates pinned, analytic SE asserted finite), validated by a coverage simulation in which logit stays at nominal coverage (~0.95, SE/SD ~1) and the probit/cloglog SE no longer explodes - added input validation at the boundary: a factor or character outcome/target now raises a clear error (instead of silently returning NA), and the not-yet-implemented overlap controls control_sel(dependence = TRUE) / key = ... now error with a clear message instead of being silently accepted - fixed a crash in IPW/DR variable selection on the population-totals-only path (no svydesign): once cross-validation dropped a covariate, the response was re-extracted by routing the still-full outcome formula through the population-totals model frame, whose name check then failed with "Selection and population totals have different names."; the response is now read directly from the data, independent of the reduced totals. A warning is also emitted flagging this path as experimental, because the penalty (lambda) selection is unreliable here and can collapse to an over-sparse (even intercept-only) model that biases the estimate -- supplying a unit-level svydesign or a fixed control_sel(lambda = ...) is recommended instead - fixed the cross-validation loss for IPW variable selection in the population-totals-only path: the known population totals were normalised by sum(weights) (which equals the non-probability sample size when no probability sample is present) instead of sum(1/pi_hat) (the HT estimate of N), mis-scaling the calibration target by ~N/n and mis-directing the penalty (lambda) selection; the totals term now uses the same N_nons normalisation as the estimating equation - stabilised the nonparametric (npar) mass-imputation analytic variance: the loess inclusion-propensity proxy can return improper fitted values (<= 0 or near zero), which made the 1/pi_hat^2 weight spurious or non-finite; the propensity is now floored at 1/sqrt(N) (so each unit's inverse-propensity weight is capped at N) before inverting - corrected the documented general c-vector formula in the GLM mass-imputation analytic variance (method_glm()): the equation had been transcribed from Kim et al. (2021, eq. 9) in that paper's labelling, where sample B is the non-probability sample and A the probability sample -- the opposite of this package's convention (A = non-probability). The sample roles in the general formula are now swapped to match the package convention, the linear special case, and the (already correct) implementation; this is a documentation-only fix with no change to computed variances - documentation polishing: corrected the documented epsilon defaults in control_out() (1e-8) and control_sel() (1e-4) to match the code, fixed a var_tot -> var_total return-value name in the method_nn() documentation, removed a stray character in the method_npar() family_outcome parameter doc, and dropped a redundant sum() in the printed IPW-weights total - validated that a supplied pop_totals is a named vector whose first element is (Intercept), erroring early with a clear message instead of silently producing an NA population size and NaN standard errors - fixed confint() to honor the requested level regardless of the fit's control_inf(alpha); previously confint(level = 0.95) returned the stored interval built with the fit's alpha, so a model fitted with e.g. alpha = 0.1 had its 90% interval mislabelled as 95% - enabled the Rcpp variable-selection non-convergence warning (previously dead code) and fixed an off-by-one so the solver runs the full maxit iterations; the warning now fires under verbose = TRUE - corrected the displayed mass-imputation GLM V_1 variance formula in the documentation (squared residual and squared bracket); the implementation was already correct - fixed the doubly robust analytic standard error under control_inf(bias_correction = TRUE): the non-probability variance component (Yang-Kim-Song 2020, eq. 25) used the frequency case_weights instead of the inverse-probability weights 1/pi, which made the standard error anticonservative (95% confidence-interval coverage ~0.87 instead of 0.95 in simulation); it now uses the bias-corrected inverse-probability weights and is computed per outcome so an outcome's SE no longer depends on the other outcomes fitted alongside it - fixed the probit propensity-score analytic variance under control_sel(est_method = "gee"): a scalar ifelse() truncated the propensity-score derivative vectors (ps_nons_der, est_ps_rand_der) to length 1, which corrupted the probability-sample variance component and inflated the standard error; replaced it with a scalar if/else and added a probit-GEE analytic-SE regression test - warm-started the single-core IPW pop_totals GEE bootstrap to match the multicore path (closes #120) - warm-started the DR bias_correction bootstrap solver from the original-data fit, speeding it up with identical estimates (closes #119) - added control_out(pmm_k_max = ...) to cap the pmm_k_choice = "min_var" search grid and documented its cost (closes #116) - added a regression test for finite analytic SE with pop_totals and a non-gaussian family_outcome (closes #71) - fixed parallel bootstrap (cores > 1) so set.seed() makes results reproducible by switching the MI, IPW, and DR multicore loops from foreach::%dopar% to doRNG::%dorng%; this also seeds NN/PMM tie-breaking inside workers (closes #115) - fixed an indefinite hang in the MI bootstrap variance path for method_outcome %in% c("nn", "pmm"): boot_mi() previously hand-mutated a subset of the svyrep.design slots, which left $selfrep at the original length and made survey::svytotal() inside method_nn()/method_pmm() fail deterministically with "(subscript) logical subscript too long"; a silent tryCatch around the per-replicate loop swallowed the error without advancing the counter, so the call ran forever -- the subset now goes through [.svyrep.design and a bounded retry surfaces any deterministic failure as a proper error - implemented the Kim & Haziza (2014) low-dimensional joint estimator for control_inf(bias_correction = TRUE) without variable selection, replacing the previous object 'bias_corr_ys_rand_pred' not found crash; the DR analytic and bootstrap variance paths now route through the same joint-estimation helper used by the Yang-Kim-Song (2020) high-dimensional path, and bias_correction = TRUE without svydesign is rejected up front (closes #114) -- thanks to Tommy Nyberg for reporting. - shortened CRAN-facing examples, added spelling wordlist entries, and made R CMD check run a fast tinytest subset by default while keeping the full suite available on GitHub via NONPROBSVY_FULL_TESTS=true - corrected IPW-GEE analytic variance components for h(x)=x/pi, the GEE b-vector sign, and probit/cloglog probability-sample scaling for h(x)=x, and added plain-R/Rcpp GEE formula checks - aligned MLE IPW analytic variance components for logit, probit, and cloglog propensity models with the shared pi_dot formula and nonprobability-sample IPW plug-in scale, including the known-N probit adjustment - allowed outcome variable-selection models to use a user-specified control_out(lambda = ...) value instead of always running cross-validation (closes #66) - returned bootstrap IPW weights in boot_ipw_weights when bootstrap variance estimation is used and bootstrap results are kept (closes #76) - improved the Rcpp variable-selection cross-validation code, fixed SCAD penalty tapering, switched fold sampling to R's RNG for set.seed() reproducibility, and added benchmark evidence for the speedup (closes #103) - added DR regression tests for one-outcome versus multi-outcome analytic uncertainty-component consistency and multi-outcome bootstrap output shapes (addresses part of #101) - added MI regression tests for one-outcome versus multi-outcome output, confidence-interval, and uncertainty-component consistency across GLM, NN, PMM, and NPAR backends (addresses part of #101) - added IPW regression tests for one-outcome versus multi-outcome analytic variance, HT versus Hajek denominator metadata, and probit/cloglog variable-selection smoke coverage (addresses part of #101) - made make_outcomes() return names explicit so callers no longer rely on partial $ matching (closes #100) - stabilized extreme-propensity IPW computations by clamping fitted probabilities, using tail-stable log-likelihood formulas, and guarding Rcpp variable-selection weights (closes #102) - clarified supported data structures, estimator-by-outcome scope, README examples, and verbose argument documentation (closes #95, #97, #98) - tightened validation for pop_means, case_weights, and logical inference-control flags so unsupported inputs fail early with clearer messages (closes #96) - fixed multicore IPW bootstrap for population-totals-only runs and kept bootstrap replicate output shapes consistent (closes #94) - aligned IPW point-estimator denominators with Horvitz-Thompson vs Hajek estimator behavior, added estimator-family metadata and print output, and documented when each estimator is used (closes #89) - fixed incorrect analytic uncertainty for multi-outcome IPW and DR fits by aligning outcome-specific variance and confidence-interval indexing (closes #87, #88) - fixed the NN mass-imputation k = 1 path by avoiding dimension drop errors and using leave-one-out matching for the non-probability variance proxy (closes #92) - fixed nn_exact_se = TRUE so the mini-bootstrap uses bootstrap-specific nearest-neighbor matches instead of reusing the original donor matches (closes #91) - fixed PMM pmm_k_choice = "min_var" so the best k found so far is returned instead of the first non-improving k (closes #93) - fixed nonprob_mi() model-frame construction to pass case_weights instead of an undefined internal weights symbol; this is a plumbing fix and does not otherwise change current case-weight semantics (closes #99) - reused nearest-neighbor search results across outcomes for multi-outcome NN mass-imputation fits (closes #104) - randomized equal-distance nearest-neighbor tie handling before donor aggregation, including hidden cutoff ties beyond the neighbours initially returned by RANN::nn2() (closes #105) - changed PMM pmm_k_choice = "min_var" to perform a full search over the candidate k grid rather than stopping at the first non-improving value (closes #106) - aligned NN and PMM point estimates and probability-side analytic variances with known-N Horvitz-Thompson means when pop_size is supplied (closes #107) - updated NN/PMM function documentation for known-N means, leave-one-out NN variance proxy, weighted mini-bootstrap, random tie handling, and full-grid PMM k selection (closes #108) - added regression tests for multi-outcome analytic variance, NN k handling, bootstrap-based NN exact SE, and PMM k choice Changes in version 0.2.3 (2025-08-20) - changes to the documentation to meet the JSS requirements - documentation polishing Changes in version 0.2.2 (2025-05-24) - new hex sticker by Oliwia Awuku - minor changes to the code, e.g. control_out(eps=1e-8) - fixing a bug in the bootstrap variance estimator the method_nn and method_pmm - fixing bootstrap for doubly robust estimators - more unit-tests for doubly robust estimators and other methods - more informative vignette for method_glm Changes in version 0.2.1 (2025-04-22) - titles corrected - new S3 method extract added which allows to extract results from the nonprob object - new S3 method coef added which allows to obtain the coefficients of underlying models (if possible) - fixed CRAN notes (unit tests for the IPW estimator cloglog) - removed sampling package from suggested package - added simple plot method - improvements in the linear algebra - corrected the check_balance error (closes #75) - code cleaning Changes in version 0.2.0 (2025-03-27) Breaking changes - functions pop.size, controlSel, controlOut and controlInf were renamed to pop_size, control_sel, control_out and control_inf respectively. - function genSimData removed completely as it is not used anywhere in the package. - argument maxLik_method renamed to maxlik_method in the control_sel function. - control_out function: - predictive_match renamed to pmm_match_type to align with the PMM (Predictive Mean Matching) estimator naming convention, where all related parameters start with pmm_ - control_sel function: - argument method removed as it was not used - argument est_method_sel renamed to est_method - argument h renamed to gee_h_fun to make this more readable to the user - start_type now accepts only zero and mle (for gee models only). - control_inf function: - bias_inf renamed to vars_combine and type changed to logical. TRUE if variables (its levels) should be combined after variable selection algorithm for the doubly robust approach. - pi_ij -- argument removed as it is not used. - nonprobsvy class renamed to nonprob and all related method adjusted to this change - functions logit_model_nonprobsvy, probit_model_nonprobsvy and cloglog_model_nonprobsvy removed in the favour of more readable method_ps function that specifies the propensity score model - new option control_inference=control_inf(vars_combine=TRUE) which allows doubly robust estimator to combine variables prior estimation i.e. if selection=~x1+x2 and y~x1+x3 then the following models are fitted selection=~x1+x2+x3 and y~x1+x2+x3. By default we set control_inference=control_inf(vars_combine=FALSE). Note that this behaviour is assumed independently from variable selection. - argument nonprob(weights=NULL) replaced to nonprob(case_weights=NULL) to stress that this refer to case weights not sampling or other weights in non-probability sample Features - two additional datasets have been included: jvs (Job Vacancy Survey; a probability sample survey) and admin (Central Job Offers Database; a non-probability sample survey). The units and auxiliary variables have been aligned in a way that allows the data to be integrated using the methods implemented in this package. - a check_balance function was added to check the balance in the totals of the variables based on the weighted weights between the non-probability and probability samples. - citation file added. - argument na_action with default na.omit - new generic methods added: - weights -- returns IPW weights - update -- allows to update the nonprob class object - new functions added and exported: - method_ps -- for modelling propensity score - method_glm -- for modelling y using glm function - method_nn -- for the NN method - method_pmm -- for the PMM method - method_npar -- for the non-parametric method - new print.nonprob, summary.nonprob and print.nonprob_summary methods > result_mi A nonprob object - estimator type: mass imputation - method: glm (gaussian) - auxiliary variables source: survey - vars selection: false - variance estimator: analytic - population size fixed: false - naive (uncorrected) estimators: - variable y1: 3.1817 - variable y2: 1.8087 - selected estimators: - variable y1: 2.9498 (se=0.0420, ci=(2.8674, 3.0322)) - variable y2: 1.5760 (se=0.0326, ci=(1.5122, 1.6399)) number of digits can be changed using print(x, digits) as shown below > print(result_mi,2) A nonprob object - estimator type: mass imputation - method: glm (gaussian) - auxiliary variables source: survey - vars selection: false - variance estimator: analytic - population size fixed: false - naive (uncorrected) estimators: - variable y1: 3.18 - variable y2: 1.81 - selected estimators: - variable y1: 2.95 (se=0.04, ci=(2.87, 3.03)) - variable y2: 1.58 (se=0.03, ci=(1.51, 1.64)) > summary(result_mi) |> print(digits=2) A nonprob_summary object - call: nonprob(data = subset(population, flag_bd1 == 1), outcome = y1 + y2 ~ x1 + x2, svydesign = sample_prob) - estimator type: mass imputation - nonprob sample size: 693011 (69.3%) - prob sample size: 1000 (0.1%) - population size: 1000000 (fixed: false) - detailed information about models are stored in list element(s): "outcome" ---------------------------------------------------------------- - distribution of outcome residuals: - y1: min: -4.79; mean: 0.00; median: 0.00; max: 4.54 - y2: min: -4.96; mean: -0.00; median: -0.07; max: 12.25 - distribution of outcome predictions (nonprob sample): - y1: min: -2.72; mean: 3.18; median: 3.04; max: 16.28 - y2: min: -1.55; mean: 1.81; median: 1.58; max: 13.92 - distribution of outcome predictions (prob sample): - y1: min: -0.46; mean: 2.95; median: 2.84; max: 10.31 - y2: min: -0.58; mean: 1.58; median: 1.39; max: 7.87 ---------------------------------------------------------------- Bugfixes - basic methods and functions related to variance estimation, weights and probability linking methods have been rewritten in a more optimal and readable way. Other - more informative error messages added. - documentation improved. - switching completely to snake_case. - extensive cleaning of the code. - more unit-tests added. - new dependencies: formula.tools Documentation - annotation has been added that argument strata is not supported for the time being. Replication materials - to verify the quality of the software please refer to the replication materials available here: https://github.com/ncn-foreigners/software-tutorials Changes in version 0.1.1 (2024-11-14) Bugfixes - bug Fix occurring when estimation was based on auxiliary variable, which led to compression of the data from the frame to the vector. - bug Fix related to not passing maxit argument from controlSel function to internally used nleqslv function - bug Fix related to storing vector in model_frame when predicting y_hat in mass imputation glm model when X is based in one auxiliary variable only - fix provided converting it to data.frame object. Features - added information to summary about quality of estimation basing on difference between estimated and known total values of auxiliary variables - added estimation of exact standard error for k-nearest neighbor estimator. - added breaking change to controlOut function by switching values for predictive_match argument. From now on, the predictive_match = 1 means $\hat{y}-\hat{y}$ in predictive mean matching imputation and predictive_match = 2 corresponds to $\hat{y}-y$ matching. - implemented div option when variable selection (more in documentation) for doubly robust estimation. - added more insights to nonprob output such as gradient, hessian and jacobian derived from IPW estimation for mle and gee methods when IPW or DR model executed. - added estimated inclusion probabilities and its derivatives for probability and non-probability samples to nonprob output when IPW or DR model executed. - added model_frame matrix data from probability sample used for mass imputation to nonprob when MI or DR model executed. Unit tests - added unit tests for variable selection models and mi estimation with vector of population totals available Changes in version 0.1.0 (2024-04-04) Features - implemented population mean estimation using doubly robust, inverse probability weighting and mass imputation methods - implemented inverse probability weighting models with Maximum Likelihood Estimation and Generalized Estimating Equations methods with logit, complementary log-log and probit link functions. - implemented generalized linear models, nearest neighbours and predictive mean matching methods for Mass Imputation - implemented bias correction estimators for doubly-robust approach - implemented estimation methods when vector of population means/totals is available - implemented variables selection with SCAD, LASSO and MCP penalization equations - implemented analytic and bootstrap (with parallel computation - doSNOW package) variance for described estimators - added control parameters for models - added S3 methods for object of nonprob class such as - nobs for samples size - pop.size for population size estimation - residuals for residuals of the inverse probability weighting model - cooks.distance for identifying influential observations that have a significant impact on the parameter estimates - hatvalues for measuring the leverage of individual observations - logLik for computing the log-likelihood of the model, - AIC (Akaike Information Criterion) for evaluating the model based on the trade-off between goodness of fit and complexity, helping in model selection - BIC (Bayesian Information Criterion) for a similar purpose as AIC but with a stronger penalty for model complexity - confint for calculating confidence intervals around parameter estimates - vcov for obtaining the variance-covariance matrix of the parameter estimates - deviance for assessing the goodness of fit of the model Unit tests - added unit tests for IPW estimators. Github repository - added automated R-cmd check Documentation - added documentation for nonprob function.