Changes in version 0.3.0                        

  - print() and summary() now describe the IPW point estimator
    concisely: the estimator-type line reads IPW (Hajek, denominator:
    <N>) or IPW (HT, denominator: <N>) — HT only when pop_size is
    user-specified (i.e. pop_size_fixed = TRUE), otherwise Hajek with
    denominator round(sum(ipw_weights)); the separate IPW point
    estimator: line is removed for pure IPW objects and kept (corrected)
    only for doubly robust objects
  - fixed control_out() treetype choices: corrected from c("kd", "rp",
    "ball") to c("kd", "bd") to match what RANN::nn2() actually accepts;
    "rp" and "ball" were not valid RANN tree types and caused a crash,
    while "bd" (a legitimate RANN option) was wrongly rejected
  - changed the default of num_boot in control_inf() from 500 to 100 and
    nlambda in control_out() from 100 to 50 to align with the
    documentation
  - vignette based on the JSS paper added
  - documentation notation adjusted to match the JSS paper
  - we thank the authors of StatsClaw.ai, the tool that allowed us to
    identify bugs and improve the code
  - documented and stabilised the doubly robust analytic variance for
    non-logit propensity links: the plug-in doubly robust variance
    (Chen, Li & Wu 2020, Theorem 2, eq. 14) is derived under the
    logistic model, where the probability-sample (design) variance
    factor pi is bounded by 1; for probit (inverse-Mills factor phi / (1
    - pi)) and cloglog (factor log(1 - pi) == -exp(eta)) it is
    conservative (over-estimates the standard error) and previously
    could overflow to a huge or non-finite SE when a fitted propensity
    approached 0 or 1 (in a small-propensity simulation the probit
    analytic SE reached ~1e27). The non-logit doubly robust variance
    terms (t_vec, var_nonprob, and the b-vector weights psd / pi^2 and
    (1 - pi) / pi^2 * exp(eta)) now floor the fitted propensity away
    from {0, 1} at 1 / sqrt(N) so they stay finite, a one-time note
    recommends control_inf(var_method = "bootstrap") for probit/cloglog
    doubly robust inference, and the documentation states the
    conservativeness. The logistic path is left untouched (its standard
    error is unchanged). The previously commented-out probit/cloglog
    doubly robust tests are re-enabled (point estimates pinned, analytic
    SE asserted finite), validated by a coverage simulation in which
    logit stays at nominal coverage (~0.95, SE/SD ~1) and the
    probit/cloglog SE no longer explodes
  - added input validation at the boundary: a factor or character
    outcome/target now raises a clear error (instead of silently
    returning NA), and the not-yet-implemented overlap controls
    control_sel(dependence = TRUE) / key = ... now error with a clear
    message instead of being silently accepted
  - fixed a crash in IPW/DR variable selection on the
    population-totals-only path (no svydesign): once cross-validation
    dropped a covariate, the response was re-extracted by routing the
    still-full outcome formula through the population-totals model
    frame, whose name check then failed with "Selection and population
    totals have different names."; the response is now read directly
    from the data, independent of the reduced totals. A warning is also
    emitted flagging this path as experimental, because the penalty
    (lambda) selection is unreliable here and can collapse to an
    over-sparse (even intercept-only) model that biases the estimate --
    supplying a unit-level svydesign or a fixed control_sel(lambda =
    ...) is recommended instead
  - fixed the cross-validation loss for IPW variable selection in the
    population-totals-only path: the known population totals were
    normalised by sum(weights) (which equals the non-probability sample
    size when no probability sample is present) instead of sum(1/pi_hat)
    (the HT estimate of N), mis-scaling the calibration target by ~N/n
    and mis-directing the penalty (lambda) selection; the totals term
    now uses the same N_nons normalisation as the estimating equation
  - stabilised the nonparametric (npar) mass-imputation analytic
    variance: the loess inclusion-propensity proxy can return improper
    fitted values (<= 0 or near zero), which made the 1/pi_hat^2 weight
    spurious or non-finite; the propensity is now floored at 1/sqrt(N)
    (so each unit's inverse-propensity weight is capped at N) before
    inverting
  - corrected the documented general c-vector formula in the GLM
    mass-imputation analytic variance (method_glm()): the equation had
    been transcribed from Kim et al. (2021, eq. 9) in that paper's
    labelling, where sample B is the non-probability sample and A the
    probability sample -- the opposite of this package's convention (A =
    non-probability). The sample roles in the general formula are now
    swapped to match the package convention, the linear special case,
    and the (already correct) implementation; this is a
    documentation-only fix with no change to computed variances
  - documentation polishing: corrected the documented epsilon defaults
    in control_out() (1e-8) and control_sel() (1e-4) to match the code,
    fixed a var_tot -> var_total return-value name in the method_nn()
    documentation, removed a stray character in the method_npar()
    family_outcome parameter doc, and dropped a redundant sum() in the
    printed IPW-weights total
  - validated that a supplied pop_totals is a named vector whose first
    element is (Intercept), erroring early with a clear message instead
    of silently producing an NA population size and NaN standard errors
  - fixed confint() to honor the requested level regardless of the fit's
    control_inf(alpha); previously confint(level = 0.95) returned the
    stored interval built with the fit's alpha, so a model fitted with
    e.g. alpha = 0.1 had its 90% interval mislabelled as 95%
  - enabled the Rcpp variable-selection non-convergence warning
    (previously dead code) and fixed an off-by-one so the solver runs
    the full maxit iterations; the warning now fires under verbose =
    TRUE
  - corrected the displayed mass-imputation GLM V_1 variance formula in
    the documentation (squared residual and squared bracket); the
    implementation was already correct
  - fixed the doubly robust analytic standard error under
    control_inf(bias_correction = TRUE): the non-probability variance
    component (Yang-Kim-Song 2020, eq. 25) used the frequency
    case_weights instead of the inverse-probability weights 1/pi, which
    made the standard error anticonservative (95% confidence-interval
    coverage ~0.87 instead of 0.95 in simulation); it now uses the
    bias-corrected inverse-probability weights and is computed per
    outcome so an outcome's SE no longer depends on the other outcomes
    fitted alongside it
  - fixed the probit propensity-score analytic variance under
    control_sel(est_method = "gee"): a scalar ifelse() truncated the
    propensity-score derivative vectors (ps_nons_der, est_ps_rand_der)
    to length 1, which corrupted the probability-sample variance
    component and inflated the standard error; replaced it with a scalar
    if/else and added a probit-GEE analytic-SE regression test
  - warm-started the single-core IPW pop_totals GEE bootstrap to match
    the multicore path (closes #120)
  - warm-started the DR bias_correction bootstrap solver from the
    original-data fit, speeding it up with identical estimates (closes
    #119)
  - added control_out(pmm_k_max = ...) to cap the pmm_k_choice =
    "min_var" search grid and documented its cost (closes #116)
  - added a regression test for finite analytic SE with pop_totals and a
    non-gaussian family_outcome (closes #71)
  - fixed parallel bootstrap (cores > 1) so set.seed() makes results
    reproducible by switching the MI, IPW, and DR multicore loops from
    foreach::%dopar% to doRNG::%dorng%; this also seeds NN/PMM
    tie-breaking inside workers (closes #115)
  - fixed an indefinite hang in the MI bootstrap variance path for
    method_outcome %in% c("nn", "pmm"): boot_mi() previously
    hand-mutated a subset of the svyrep.design slots, which left
    $selfrep at the original length and made survey::svytotal() inside
    method_nn()/method_pmm() fail deterministically with "(subscript)
    logical subscript too long"; a silent tryCatch around the
    per-replicate loop swallowed the error without advancing the
    counter, so the call ran forever -- the subset now goes through
    [.svyrep.design and a bounded retry surfaces any deterministic
    failure as a proper error
  - implemented the Kim & Haziza (2014) low-dimensional joint estimator
    for control_inf(bias_correction = TRUE) without variable selection,
    replacing the previous object 'bias_corr_ys_rand_pred' not found
    crash; the DR analytic and bootstrap variance paths now route
    through the same joint-estimation helper used by the Yang-Kim-Song
    (2020) high-dimensional path, and bias_correction = TRUE without
    svydesign is rejected up front (closes #114) -- thanks to Tommy
    Nyberg for reporting.
  - shortened CRAN-facing examples, added spelling wordlist entries, and
    made R CMD check run a fast tinytest subset by default while keeping
    the full suite available on GitHub via NONPROBSVY_FULL_TESTS=true
  - corrected IPW-GEE analytic variance components for h(x)=x/pi, the
    GEE b-vector sign, and probit/cloglog probability-sample scaling for
    h(x)=x, and added plain-R/Rcpp GEE formula checks
  - aligned MLE IPW analytic variance components for logit, probit, and
    cloglog propensity models with the shared pi_dot formula and
    nonprobability-sample IPW plug-in scale, including the known-N
    probit adjustment
  - allowed outcome variable-selection models to use a user-specified
    control_out(lambda = ...) value instead of always running
    cross-validation (closes #66)
  - returned bootstrap IPW weights in boot_ipw_weights when bootstrap
    variance estimation is used and bootstrap results are kept (closes
    #76)
  - improved the Rcpp variable-selection cross-validation code, fixed
    SCAD penalty tapering, switched fold sampling to R's RNG for
    set.seed() reproducibility, and added benchmark evidence for the
    speedup (closes #103)
  - added DR regression tests for one-outcome versus multi-outcome
    analytic uncertainty-component consistency and multi-outcome
    bootstrap output shapes (addresses part of #101)
  - added MI regression tests for one-outcome versus multi-outcome
    output, confidence-interval, and uncertainty-component consistency
    across GLM, NN, PMM, and NPAR backends (addresses part of #101)
  - added IPW regression tests for one-outcome versus multi-outcome
    analytic variance, HT versus Hajek denominator metadata, and
    probit/cloglog variable-selection smoke coverage (addresses part of
    #101)
  - made make_outcomes() return names explicit so callers no longer rely
    on partial $ matching (closes #100)
  - stabilized extreme-propensity IPW computations by clamping fitted
    probabilities, using tail-stable log-likelihood formulas, and
    guarding Rcpp variable-selection weights (closes #102)
  - clarified supported data structures, estimator-by-outcome scope,
    README examples, and verbose argument documentation (closes #95,
    #97, #98)
  - tightened validation for pop_means, case_weights, and logical
    inference-control flags so unsupported inputs fail early with
    clearer messages (closes #96)
  - fixed multicore IPW bootstrap for population-totals-only runs and
    kept bootstrap replicate output shapes consistent (closes #94)
  - aligned IPW point-estimator denominators with Horvitz-Thompson vs
    Hajek estimator behavior, added estimator-family metadata and print
    output, and documented when each estimator is used (closes #89)
  - fixed incorrect analytic uncertainty for multi-outcome IPW and DR
    fits by aligning outcome-specific variance and confidence-interval
    indexing (closes #87, #88)
  - fixed the NN mass-imputation k = 1 path by avoiding dimension drop
    errors and using leave-one-out matching for the non-probability
    variance proxy (closes #92)
  - fixed nn_exact_se = TRUE so the mini-bootstrap uses
    bootstrap-specific nearest-neighbor matches instead of reusing the
    original donor matches (closes #91)
  - fixed PMM pmm_k_choice = "min_var" so the best k found so far is
    returned instead of the first non-improving k (closes #93)
  - fixed nonprob_mi() model-frame construction to pass case_weights
    instead of an undefined internal weights symbol; this is a plumbing
    fix and does not otherwise change current case-weight semantics
    (closes #99)
  - reused nearest-neighbor search results across outcomes for
    multi-outcome NN mass-imputation fits (closes #104)
  - randomized equal-distance nearest-neighbor tie handling before donor
    aggregation, including hidden cutoff ties beyond the neighbours
    initially returned by RANN::nn2() (closes #105)
  - changed PMM pmm_k_choice = "min_var" to perform a full search over
    the candidate k grid rather than stopping at the first non-improving
    value (closes #106)
  - aligned NN and PMM point estimates and probability-side analytic
    variances with known-N Horvitz-Thompson means when pop_size is
    supplied (closes #107)
  - updated NN/PMM function documentation for known-N means,
    leave-one-out NN variance proxy, weighted mini-bootstrap, random tie
    handling, and full-grid PMM k selection (closes #108)
  - added regression tests for multi-outcome analytic variance, NN k
    handling, bootstrap-based NN exact SE, and PMM k choice

                 Changes in version 0.2.3 (2025-08-20)                  

  - changes to the documentation to meet the JSS requirements
  - documentation polishing

                 Changes in version 0.2.2 (2025-05-24)                  

  - new hex sticker by Oliwia Awuku
  - minor changes to the code, e.g. control_out(eps=1e-8)
  - fixing a bug in the bootstrap variance estimator the method_nn and
    method_pmm
  - fixing bootstrap for doubly robust estimators
  - more unit-tests for doubly robust estimators and other methods
  - more informative vignette for method_glm

                 Changes in version 0.2.1 (2025-04-22)                  

  - titles corrected
  - new S3 method extract added which allows to extract results from the
    nonprob object
  - new S3 method coef added which allows to obtain the coefficients of
    underlying models (if possible)
  - fixed CRAN notes (unit tests for the IPW estimator cloglog)
  - removed sampling package from suggested package
  - added simple plot method
  - improvements in the linear algebra
  - corrected the check_balance error (closes #75)
  - code cleaning

                 Changes in version 0.2.0 (2025-03-27)                  

Breaking changes

  - functions pop.size, controlSel, controlOut and controlInf were
    renamed to pop_size, control_sel, control_out and control_inf
    respectively.
  - function genSimData removed completely as it is not used anywhere in
    the package.
  - argument maxLik_method renamed to maxlik_method in the control_sel
    function.
  - control_out function:
      - predictive_match renamed to pmm_match_type to align with the PMM
        (Predictive Mean Matching) estimator naming convention, where
        all related parameters start with pmm_
  - control_sel function:
      - argument method removed as it was not used
      - argument est_method_sel renamed to est_method
      - argument h renamed to gee_h_fun to make this more readable to
        the user
      - start_type now accepts only zero and mle (for gee models only).
  - control_inf function:
      - bias_inf renamed to vars_combine and type changed to logical.
        TRUE if variables (its levels) should be combined after variable
        selection algorithm for the doubly robust approach.
      - pi_ij -- argument removed as it is not used.
  - nonprobsvy class renamed to nonprob and all related method adjusted
    to this change
  - functions logit_model_nonprobsvy, probit_model_nonprobsvy and
    cloglog_model_nonprobsvy removed in the favour of more readable
    method_ps function that specifies the propensity score model
  - new option control_inference=control_inf(vars_combine=TRUE) which
    allows doubly robust estimator to combine variables prior estimation
    i.e. if selection=~x1+x2 and y~x1+x3 then the following models are
    fitted selection=~x1+x2+x3 and y~x1+x2+x3. By default we set
    control_inference=control_inf(vars_combine=FALSE). Note that this
    behaviour is assumed independently from variable selection.
  - argument nonprob(weights=NULL) replaced to
    nonprob(case_weights=NULL) to stress that this refer to case weights
    not sampling or other weights in non-probability sample

Features

  - two additional datasets have been included: jvs (Job Vacancy Survey;
    a probability sample survey) and admin (Central Job Offers Database;
    a non-probability sample survey). The units and auxiliary variables
    have been aligned in a way that allows the data to be integrated
    using the methods implemented in this package.
  - a check_balance function was added to check the balance in the
    totals of the variables based on the weighted weights between the
    non-probability and probability samples.
  - citation file added.
  - argument na_action with default na.omit
  - new generic methods added:
      - weights -- returns IPW weights
      - update -- allows to update the nonprob class object
  - new functions added and exported:
      - method_ps -- for modelling propensity score
      - method_glm -- for modelling y using glm function
      - method_nn -- for the NN method
      - method_pmm -- for the PMM method
      - method_npar -- for the non-parametric method
  - new print.nonprob, summary.nonprob and print.nonprob_summary methods

> result_mi
A nonprob object
 - estimator type: mass imputation
 - method: glm (gaussian)
 - auxiliary variables source: survey
 - vars selection: false
 - variance estimator: analytic
 - population size fixed: false
 - naive (uncorrected) estimators:
   - variable y1: 3.1817
   - variable y2: 1.8087
 - selected estimators:
   - variable y1: 2.9498 (se=0.0420, ci=(2.8674, 3.0322))
   - variable y2: 1.5760 (se=0.0326, ci=(1.5122, 1.6399))

number of digits can be changed using print(x, digits) as shown below

> print(result_mi,2)
A nonprob object
 - estimator type: mass imputation
 - method: glm (gaussian)
 - auxiliary variables source: survey
 - vars selection: false
 - variance estimator: analytic
 - population size fixed: false
 - naive (uncorrected) estimators:
   - variable y1: 3.18
   - variable y2: 1.81
 - selected estimators:
   - variable y1: 2.95 (se=0.04, ci=(2.87, 3.03))
   - variable y2: 1.58 (se=0.03, ci=(1.51, 1.64))

> summary(result_mi) |> print(digits=2)
A nonprob_summary object
 - call: nonprob(data = subset(population, flag_bd1 == 1), outcome = y1 + 
    y2 ~ x1 + x2, svydesign = sample_prob)
 - estimator type: mass imputation
 - nonprob sample size: 693011 (69.3%)
 - prob sample size: 1000 (0.1%)
 - population size: 1000000 (fixed: false)
 - detailed information about models are stored in list element(s): "outcome"
----------------------------------------------------------------
 - distribution of outcome residuals:
   - y1: min: -4.79; mean: 0.00; median: 0.00; max: 4.54
   - y2: min: -4.96; mean: -0.00; median: -0.07; max: 12.25
 - distribution of outcome predictions (nonprob sample):
   - y1: min: -2.72; mean: 3.18; median: 3.04; max: 16.28
   - y2: min: -1.55; mean: 1.81; median: 1.58; max: 13.92
 - distribution of outcome predictions (prob sample):
   - y1: min: -0.46; mean: 2.95; median: 2.84; max: 10.31
   - y2: min: -0.58; mean: 1.58; median: 1.39; max: 7.87
----------------------------------------------------------------

Bugfixes

  - basic methods and functions related to variance estimation, weights
    and probability linking methods have been rewritten in a more
    optimal and readable way.

Other

  - more informative error messages added.
  - documentation improved.
  - switching completely to snake_case.
  - extensive cleaning of the code.
  - more unit-tests added.
  - new dependencies: formula.tools

Documentation

  - annotation has been added that argument strata is not supported for
    the time being.

Replication materials

  - to verify the quality of the software please refer to the
    replication materials available here:
    https://github.com/ncn-foreigners/software-tutorials

                 Changes in version 0.1.1 (2024-11-14)                  

Bugfixes

  - bug Fix occurring when estimation was based on auxiliary variable,
    which led to compression of the data from the frame to the vector.
  - bug Fix related to not passing maxit argument from controlSel
    function to internally used nleqslv function
  - bug Fix related to storing vector in model_frame when predicting
    y_hat in mass imputation glm model when X is based in one auxiliary
    variable only - fix provided converting it to data.frame object.

Features

  - added information to summary about quality of estimation basing on
    difference between estimated and known total values of auxiliary
    variables
  - added estimation of exact standard error for k-nearest neighbor
    estimator.
  - added breaking change to controlOut function by switching values for
    predictive_match argument. From now on, the predictive_match = 1
    means $\hat{y}-\hat{y}$ in predictive mean matching imputation and
    predictive_match = 2 corresponds to $\hat{y}-y$ matching.
  - implemented div option when variable selection (more in
    documentation) for doubly robust estimation.
  - added more insights to nonprob output such as gradient, hessian and
    jacobian derived from IPW estimation for mle and gee methods when
    IPW or DR model executed.
  - added estimated inclusion probabilities and its derivatives for
    probability and non-probability samples to nonprob output when IPW
    or DR model executed.
  - added model_frame matrix data from probability sample used for mass
    imputation to nonprob when MI or DR model executed.

Unit tests

  - added unit tests for variable selection models and mi estimation
    with vector of population totals available

                 Changes in version 0.1.0 (2024-04-04)                  

Features

  - implemented population mean estimation using doubly robust, inverse
    probability weighting and mass imputation methods
  - implemented inverse probability weighting models with Maximum
    Likelihood Estimation and Generalized Estimating Equations methods
    with logit, complementary log-log and probit link functions.
  - implemented generalized linear models, nearest neighbours and
    predictive mean matching methods for Mass Imputation
  - implemented bias correction estimators for doubly-robust approach
  - implemented estimation methods when vector of population
    means/totals is available
  - implemented variables selection with SCAD, LASSO and MCP
    penalization equations
  - implemented analytic and bootstrap (with parallel computation -
    doSNOW package) variance for described estimators
  - added control parameters for models
  - added S3 methods for object of nonprob class such as
      - nobs for samples size
      - pop.size for population size estimation
      - residuals for residuals of the inverse probability weighting
        model
      - cooks.distance for identifying influential observations that
        have a significant impact on the parameter estimates
      - hatvalues for measuring the leverage of individual observations
      - logLik for computing the log-likelihood of the model,
      - AIC (Akaike Information Criterion) for evaluating the model
        based on the trade-off between goodness of fit and complexity,
        helping in model selection
      - BIC (Bayesian Information Criterion) for a similar purpose as
        AIC but with a stronger penalty for model complexity
      - confint for calculating confidence intervals around parameter
        estimates
      - vcov for obtaining the variance-covariance matrix of the
        parameter estimates
      - deviance for assessing the goodness of fit of the model

Unit tests

  - added unit tests for IPW estimators.

Github repository

  - added automated R-cmd check

Documentation

  - added documentation for nonprob function.