print() and summary() now describe the IPW point estimator concisely: the estimator-type line reads IPW (Hajek, denominator: <N>) or IPW (HT, denominator: <N>) — HT only when pop_size is user-specified (i.e. pop_size_fixed = TRUE), otherwise Hajek with denominator round(sum(ipw_weights)); the separate IPW point estimator: line is removed for pure IPW objects and kept (corrected) only for doubly robust objectscontrol_out() treetype choices: corrected from c("kd", "rp", "ball") to c("kd", "bd") to match what RANN::nn2() actually accepts; "rp" and "ball" were not valid RANN tree types and caused a crash, while "bd" (a legitimate RANN option) was wrongly rejectednum_boot in control_inf() from 500 to 100 and nlambda in control_out() from 100 to 50 to align with the documentationpi is bounded by 1; for probit (inverse-Mills factor phi / (1 - pi)) and cloglog (factor log(1 - pi) == -exp(eta)) it is conservative (over-estimates the standard error) and previously could overflow to a huge or non-finite SE when a fitted propensity approached 0 or 1 (in a small-propensity simulation the probit analytic SE reached ~1e27). The non-logit doubly robust variance terms (t_vec, var_nonprob, and the b-vector weights psd / pi^2 and (1 - pi) / pi^2 * exp(eta)) now floor the fitted propensity away from {0, 1} at 1 / sqrt(N) so they stay finite, a one-time note recommends control_inf(var_method = "bootstrap") for probit/cloglog doubly robust inference, and the documentation states the conservativeness. The logistic path is left untouched (its standard error is unchanged). The previously commented-out probit/cloglog doubly robust tests are re-enabled (point estimates pinned, analytic SE asserted finite), validated by a coverage simulation in which logit stays at nominal coverage (~0.95, SE/SD ~1) and the probit/cloglog SE no longer explodesNA), and the not-yet-implemented overlap controls control_sel(dependence = TRUE) / key = ... now error with a clear message instead of being silently acceptedsvydesign): once cross-validation dropped a covariate, the response was re-extracted by routing the still-full outcome formula through the population-totals model frame, whose name check then failed with "Selection and population totals have different names."; the response is now read directly from the data, independent of the reduced totals. A warning is also emitted flagging this path as experimental, because the penalty (lambda) selection is unreliable here and can collapse to an over-sparse (even intercept-only) model that biases the estimate -- supplying a unit-level svydesign or a fixed control_sel(lambda = ...) is recommended insteadsum(weights) (which equals the non-probability sample size when no probability sample is present) instead of sum(1/pi_hat) (the HT estimate of N), mis-scaling the calibration target by ~N/n and mis-directing the penalty (lambda) selection; the totals term now uses the same N_nons normalisation as the estimating equationnpar) mass-imputation analytic variance: the loess inclusion-propensity proxy can return improper fitted values (<= 0 or near zero), which made the 1/pi_hat^2 weight spurious or non-finite; the propensity is now floored at 1/sqrt(N) (so each unit's inverse-propensity weight is capped at N) before invertingc-vector formula in the GLM mass-imputation analytic variance (method_glm()): the equation had been transcribed from Kim et al. (2021, eq. 9) in that paper's labelling, where sample B is the non-probability sample and A the probability sample -- the opposite of this package's convention (A = non-probability). The sample roles in the general formula are now swapped to match the package convention, the linear special case, and the (already correct) implementation; this is a documentation-only fix with no change to computed variancesepsilon defaults in control_out() (1e-8) and control_sel() (1e-4) to match the code, fixed a var_tot -> var_total return-value name in the method_nn() documentation, removed a stray character in the method_npar() family_outcome parameter doc, and dropped a redundant sum() in the printed IPW-weights totalpop_totals is a named vector whose first element is (Intercept), erroring early with a clear message instead of silently producing an NA population size and NaN standard errorsconfint() to honor the requested level regardless of the fit's control_inf(alpha); previously confint(level = 0.95) returned the stored interval built with the fit's alpha, so a model fitted with e.g. alpha = 0.1 had its 90% interval mislabelled as 95%maxit iterations; the warning now fires under verbose = TRUEV_1 variance formula in the documentation (squared residual and squared bracket); the implementation was already correctcontrol_inf(bias_correction = TRUE): the non-probability variance component (Yang-Kim-Song 2020, eq. 25) used the frequency case_weights instead of the inverse-probability weights 1/pi, which made the standard error anticonservative (95% confidence-interval coverage ~0.87 instead of 0.95 in simulation); it now uses the bias-corrected inverse-probability weights and is computed per outcome so an outcome's SE no longer depends on the other outcomes fitted alongside itcontrol_sel(est_method = "gee"): a scalar ifelse() truncated the propensity-score derivative vectors (ps_nons_der, est_ps_rand_der) to length 1, which corrupted the probability-sample variance component and inflated the standard error; replaced it with a scalar if/else and added a probit-GEE analytic-SE regression testpop_totals GEE bootstrap to match the multicore path (closes #120)bias_correction bootstrap solver from the original-data fit, speeding it up with identical estimates (closes #119)control_out(pmm_k_max = ...) to cap the pmm_k_choice = "min_var" search grid and documented its cost (closes #116)pop_totals and a non-gaussian family_outcome (closes #71)set.seed() makes results reproducible by switching the MI, IPW, and DR multicore loops from foreach::%dopar% to doRNG::%dorng%; this also seeds NN/PMM tie-breaking inside workers (closes #115)method_outcome %in% c("nn", "pmm"): boot_mi() previously hand-mutated a subset of the svyrep.design slots, which left $selfrep at the original length and made survey::svytotal() inside method_nn()/method_pmm() fail deterministically with "(subscript) logical subscript too long"; a silent tryCatch around the per-replicate loop swallowed the error without advancing the counter, so the call ran forever -- the subset now goes through [.svyrep.design and a bounded retry surfaces any deterministic failure as a proper errorcontrol_inf(bias_correction = TRUE) without variable selection, replacing the previous object 'bias_corr_ys_rand_pred' not found crash; the DR analytic and bootstrap variance paths now route through the same joint-estimation helper used by the Yang-Kim-Song (2020) high-dimensional path, and bias_correction = TRUE without svydesign is rejected up front (closes #114) -- thanks to Tommy Nyberg for reporting.R CMD check run a fast tinytest subset by default while keeping the full suite available on GitHub via NONPROBSVY_FULL_TESTS=trueh(x)=x/pi, the GEE b-vector sign, and probit/cloglog probability-sample scaling for h(x)=x, and added plain-R/Rcpp GEE formula checkspi_dot formula and nonprobability-sample IPW plug-in scale, including the known-N probit adjustmentcontrol_out(lambda = ...) value instead of always running cross-validation (closes #66)boot_ipw_weights when bootstrap variance estimation is used and bootstrap results are kept (closes #76)set.seed() reproducibility, and added benchmark evidence for the speedup (closes #103)make_outcomes() return names explicit so callers no longer rely on partial $ matching (closes #100)verbose argument documentation (closes #95, #97, #98)pop_means, case_weights, and logical inference-control flags so unsupported inputs fail early with clearer messages (closes #96)k = 1 path by avoiding dimension drop errors and using leave-one-out matching for the non-probability variance proxy (closes #92)nn_exact_se = TRUE so the mini-bootstrap uses bootstrap-specific nearest-neighbor matches instead of reusing the original donor matches (closes #91)pmm_k_choice = "min_var" so the best k found so far is returned instead of the first non-improving k (closes #93)nonprob_mi() model-frame construction to pass case_weights instead of an undefined internal weights symbol; this is a plumbing fix and does not otherwise change current case-weight semantics (closes #99)RANN::nn2() (closes #105)pmm_k_choice = "min_var" to perform a full search over the candidate k grid rather than stopping at the first non-improving value (closes #106)N Horvitz-Thompson means when pop_size is supplied (closes #107)N means, leave-one-out NN variance proxy, weighted mini-bootstrap, random tie handling, and full-grid PMM k selection (closes #108)k handling, bootstrap-based NN exact SE, and PMM k choicecontrol_out(eps=1e-8)method_nn and method_pmmmethod_glmextract added which allows to extract results from the nonprob objectcoef added which allows to obtain the coefficients of underlying models (if possible)cloglog)sampling package from suggested packageplot methodcheck_balance error (closes #75)pop.size, controlSel, controlOut and controlInf
were renamed to pop_size, control_sel, control_out and
control_inf respectively.genSimData removed completely as it is not used anywhere
in the package.maxLik_method renamed to maxlik_method in the
control_sel function.control_out function:
predictive_match renamed to pmm_match_type to align with the
PMM (Predictive Mean Matching) estimator naming convention,
where all related parameters start with pmm_control_sel function:
method removed as it was not usedest_method_sel renamed to est_methodh renamed to gee_h_fun to make this more readable
to the userstart_type now accepts only zero and mle (for gee models
only).control_inf function:
bias_inf renamed to vars_combine and type changed to
logical. TRUE if variables (its levels) should be combined
after variable selection algorithm for the doubly robust
approach.pi_ij -- argument removed as it is not used.nonprobsvy class renamed to nonprob and all related method
adjusted to this changelogit_model_nonprobsvy, probit_model_nonprobsvy and
cloglog_model_nonprobsvy removed in the favour of more readable
method_ps function that specifies the propensity score modelcontrol_inference=control_inf(vars_combine=TRUE) which
allows doubly robust estimator to combine variables prior estimation
i.e. if selection=~x1+x2 and y~x1+x3 then the following models
are fitted selection=~x1+x2+x3 and y~x1+x2+x3. By default we set
control_inference=control_inf(vars_combine=FALSE). Note that this
behaviour is assumed independently from variable selection.nonprob(weights=NULL) replaced to nonprob(case_weights=NULL)
to stress that this refer to case weights not sampling or other weights
in non-probability samplejvs (Job Vacancy
Survey; a probability sample survey) and admin (Central Job Offers
Database; a non-probability sample survey). The units and auxiliary
variables have been aligned in a way that allows the data to be
integrated using the methods implemented in this package.check_balance function was added to check the balance in the
totals of the variables based on the weighted weights between the
non-probability and probability samples.na_action with default na.omitweights -- returns IPW weightsupdate -- allows to update the nonprob class objectmethod_ps -- for modelling propensity scoremethod_glm -- for modelling y using glm functionmethod_nn -- for the NN methodmethod_pmm -- for the PMM methodmethod_npar -- for the non-parametric methodprint.nonprob, summary.nonprob and print.nonprob_summary
methods> result_mi
A nonprob object
- estimator type: mass imputation
- method: glm (gaussian)
- auxiliary variables source: survey
- vars selection: false
- variance estimator: analytic
- population size fixed: false
- naive (uncorrected) estimators:
- variable y1: 3.1817
- variable y2: 1.8087
- selected estimators:
- variable y1: 2.9498 (se=0.0420, ci=(2.8674, 3.0322))
- variable y2: 1.5760 (se=0.0326, ci=(1.5122, 1.6399))
number of digits can be changed using print(x, digits) as shown below
> print(result_mi,2)
A nonprob object
- estimator type: mass imputation
- method: glm (gaussian)
- auxiliary variables source: survey
- vars selection: false
- variance estimator: analytic
- population size fixed: false
- naive (uncorrected) estimators:
- variable y1: 3.18
- variable y2: 1.81
- selected estimators:
- variable y1: 2.95 (se=0.04, ci=(2.87, 3.03))
- variable y2: 1.58 (se=0.03, ci=(1.51, 1.64))
> summary(result_mi) |> print(digits=2)
A nonprob_summary object
- call: nonprob(data = subset(population, flag_bd1 == 1), outcome = y1 +
y2 ~ x1 + x2, svydesign = sample_prob)
- estimator type: mass imputation
- nonprob sample size: 693011 (69.3%)
- prob sample size: 1000 (0.1%)
- population size: 1000000 (fixed: false)
- detailed information about models are stored in list element(s): "outcome"
----------------------------------------------------------------
- distribution of outcome residuals:
- y1: min: -4.79; mean: 0.00; median: 0.00; max: 4.54
- y2: min: -4.96; mean: -0.00; median: -0.07; max: 12.25
- distribution of outcome predictions (nonprob sample):
- y1: min: -2.72; mean: 3.18; median: 3.04; max: 16.28
- y2: min: -1.55; mean: 1.81; median: 1.58; max: 13.92
- distribution of outcome predictions (prob sample):
- y1: min: -0.46; mean: 2.95; median: 2.84; max: 10.31
- y2: min: -0.58; mean: 1.58; median: 1.39; max: 7.87
----------------------------------------------------------------
formula.toolsstrata is not supported for the time being.maxit argument from controlSel
function to internally used nleqslv functionvector in model_frame when predicting
y_hat in mass imputation glm model when X is based in one
auxiliary variable only - fix provided converting it to data.frame
object.summary about quality of estimation basing on
difference between estimated and known total values of auxiliary
variablescontrolOut function by switching values
for predictive_match argument. From now on, the
predictive_match = 1 means $\hat{y}-\hat{y}$ in predictive mean
matching imputation and predictive_match = 2 corresponds to
$\hat{y}-y$ matching.div option when variable selection (more in
documentation) for doubly robust estimation.nonprob output such as gradient, hessian
and jacobian derived from IPW estimation for mle and gee methods
when IPW or DR model executed.nonprob output when
IPW or DR model executed.model_frame matrix data from probability sample used for
mass imputation to nonprob when MI or DR model executed.logit, complementary log-log and probit link functions.generalized linear models, nearest neighbours and
predictive mean matching methods for Mass ImputationSCAD, LASSO and MCP
penalization equationsanalytic and bootstrap (with parallel computation -
doSNOW package) variance for described estimatorsnonprob class such as
nobs for samples sizepop.size for population size estimationresiduals for residuals of the inverse probability weighting
modelcooks.distance for identifying influential observations that
have a significant impact on the parameter estimateshatvalues for measuring the leverage of individual
observationslogLik for computing the log-likelihood of the model,AIC (Akaike Information Criterion) for evaluating the model
based on the trade-off between goodness of fit and complexity,
helping in model selectionBIC (Bayesian Information Criterion) for a similar purpose as
AIC but with a stronger penalty for model complexityconfint for calculating confidence intervals around parameter
estimatesvcov for obtaining the variance-covariance matrix of the
parameter estimatesdeviance for assessing the goodness of fit of the modelR-cmd checknonprob function.