Calibration weighting is a method commonly used in survey sampling to adjust original design weights for sampled elements to reproduce known population totals for all auxiliary variables (Deville and Särndal 1992). Following the calibration paradigm, it can also be used to reproduce known population quantiles for all benchmark variables (Harms and Duchesne 2006). This technique is also used in surveys to compensate for nonsampling errors, such as nonresponse or coverage errors (Särndal and Lundström 2005). By appropriately adjusting the weights, it is not only possible to ensure consistency with known structures for key variables from other data sources, such as censuses or registers, but also to reduce the bias and improve the precision of final estimates. %Calibration weighting is also used in surveys in which the analysed features have asymmetric distributions (in the presence of outliers) to compensate for their negative impact on the final estimates: calibration weights provide robustness while meeting constraints on the calibration variables and the weights (Duchesne 1999).
In this article, we propose a joint calibration approach to estimate the total or quantile of order α for the variable of interest y. Final calibration weights wk reproduce known population totals and quantiles for all auxiliary variables. At the same time, they help to reduce the bias and improve the precision of estimates. The proposed method is based on the classic approach to calibration and simultaneously takes into account calibration equations for totals and quantiles of all auxiliary variables.
In most applications the goal is to estimate a finite population total τy = ∑k ∈ Uyk or the mean τ̄y = τy/N of the variable of interest y, where U is the population of size N. The well-known estimator of a finite population total is the Horvitz-Thompson estimator, which is expressed as $\hat{\tau}_{y\pi}=\sum_{k=1}^{n}d_{k}y_{k}=\sum_{k\in s}{d_{k}y_{k}}$, where s denotes a probability sample of size n, dk = 1/πk is a design weight and πk is the first-order inclusion probability of the i-th element of the population U. This estimator is unbiased for τY i.e. E(τ̂yπ) = τY.
Let xk∘ be a J1-dimensional vector of auxiliary variables (benchmark variables) for which τx = ∑k ∈ Uxk∘ = (∑k ∈ Uxk1, …, ∑k ∈ UxkJ1)T is assumed to be known. In most cases in practice the dk weights do not reproduce known population totals for benchmark variables xk∘. It means that the resulting estimate τ̂xπ = ∑k ∈ sdkxk∘ is not equal to τx. The main idea of calibration is to look for new calibration weights wk which are as close as possible to original design weights dk and reproduce known population totals τx exactly. In other words, in order to find new calibration weights wk we have to minimise a distance function $D\left(\mathbf{d},\mathbf{v}\right)=\sum _{k\in s}d_{k}\hspace{2pt} G\hspace{0pt}\left(\frac{v_{k}}{d_{k}}\right) \to \textrm{min}$ to fulfil calibration equations ∑k ∈ svkxk∘ = ∑k ∈ Uxk∘, where d = (d1, …, dn)T, v = (v1, …, vn)T and G(⋅) is a function which must satisfy some regularity conditions: G(⋅) is strictly convex and twice continuously differentiable, G(⋅) ≥ 0, G(1) = 0, G′(1) = 0 and G″(1) = 1. Examples of G(⋅) functions are given by Deville and Särndal (1992). For instance, if $G\left(x\right)=\frac{\left(x-1\right)^{2}}{2}$, then using the method of Lagrange multipliers the final calibration weights wk can be expressed as wk = dk + dk(τx − τ̂xπ)T(∑j ∈ sdjxj∘xj∘T)−1xk∘. It is worth adding that in order to avoid negative or large wk weights in the process of minimising the D(⋅) function, one can consider some boundary constraints $L\leq \frac{w_{k}}{d_{k}}\leq U$, where 0 ≤ L ≤ 1 ≤ U, k = 1, …, n. The final calibration estimator of a population total τy can be expressed as τ̂yx = ∑k ∈ swkyk, where wk are calibration weights obtained under a specific chosen G(⋅) function.
Harms and Duchesne (2006) considered the estimation of quantiles using the calibration approach in a way very similar to the what Deville and Särndal (1992) proposed for a finite population total τy. By analogy, in their approach it is not necessary to know values for all auxiliary variables for all units in the population. It is enough to know the corresponding quantiles for the benchmark variables. We will briefly discuss the problem of finding calibration weights in this setup.
We want to estimate a quantile Qy, α of order α ∈ (0, 1) of the variable of interest y, which can be expressed as Qy, α = inf{t|Fy(t) ≥ α}, where Fy(t) = N−1∑k ∈ UH(t − yk) and the Heavyside function is given by
We assume that Qx, α = (Qx1, α, …, QxJ2, α)T is a vector of known population quantiles of order α for a vector of auxiliary variables xk*, where α ∈ (0, 1) and xk* is a J2-dimensional vector of auxiliary variables. It is worth noting that, in general, the numbers J1 and J2 of the auxiliary variables are different. It may happen that for a specific auxiliary variable its population total and the corresponding quantile of order α will be known. However, in most cases quantiles will be known for continuous auxiliary variables, unlike totals, which will generally be known for categorical variables. In order to find new calibration weights wk which reproduce known population quantiles in a vector Qx, α, an interpolated distribution function estimator of Fy(t) is defined as $\hat{F}_{y,cal}(t)=\frac{\sum_{k \in s} w_{k} H_{y, s}\left(t, y_{k}\right)}{\sum_{k \in s} w_{k}}$, where the Heavyside function in formula () is replaced by the modified function Hy, s(t, yk) given by
where Ly, s(t) = max {{yk, k ∈ s ∣ yk ≤ t} ∪ {−∞}}, Uy, s(t) = min {{yk, k ∈ s ∣ yk > t} ∪ {∞}} and $\beta_{y, s}\left(t\right)=\frac{t-L_{y, s}\left(t\right)}{U_{y, s}\left(t\right)-L_{y, s}\left(t\right)}$ for k = 1, …, n, t ∈ ℝ. A calibration estimator of quantile Qy, α of order α for variable y is defined as Q̂y, cal, α = F̂y, cal−1(α), where a vector w = (w1, …, wn)T is a solution of optimization problem $D\left(\mathbf{d},\mathbf{v}\right)=\sum _{k\in s}d_{k}\hspace{2pt} G\hspace{0pt}\left(\frac{v_{k}}{d_{k}}\right) \to \textrm{min}$ subject to the calibration constraints ∑k ∈ svk = N and $\hat{\mathbf{Q}}_{\mathbf{x},cal,\alpha}=\left(\hat{Q}_{x_{1},cal,\alpha},\ldots,\hat{Q}_{x_{J_{2}},cal,\alpha}\right)^{T}=\mathbf{Q}_{\mathbf{x},\alpha}$ or equivalently F̂xj, cal(Qxj, α) = α, where j = 1, …, J2.
As in the previous case, if $G\left(x\right)=\frac{\left(x-1\right)^{2}}{2}$ then using the method of Lagrange multipliers the final calibration weights wk can be expressed as wk = dk + dk(Ta − ∑k ∈ sdkak)T(∑j ∈ sdjajajT)−1ak, where Ta = (N, α, …, α)T and the elements of ak = (1, ak1, …, akJ2)T are given by
with j = 1, …, J2.
Assuming that y1 ≤ y2… ≤ yn it can be shown that if there exists p ∈ {1, …, n − 1} such that F̂y, cal(yp) ≤ α, F̂y, cal(yp + 1) > α and F̂y, cal is invertible at point Q̂y, cal, α then the calibration estimator Q̂y, cal, α of quantile Qy, α of order α ∈ (0, 1) can be expressed as $\hat{Q}_{y,cal,\alpha}=y_{p}+\frac{N\alpha-\sum_{i=1}^{p}{w_{i}}}{w_{p+1}}\left(y_{p+1}-y_{p}\right)$.
We propose a simple method that jointly calibrates weights for totals and quantiles. The resulting calibrated weights wk will allow us to retrieve known population totals and quantiles of auxiliary variables simultaneously. In the case of a single scalar auxiliary variable x, the final calibration estimator based on weights wk delivers an exact population total and quantile for variable y when the relationship between y and x is exactly linear i.e. when yk = βxk for all k ∈ U.
Let us assume that we are interested in estimating a population total τy and/or quantile Qy, α of order α, where α ∈ (0, 1) for variable of interest y. Let $\mathbf{x}_{k}=\left(\begin{smallmatrix}\mathbf{x}_{k}^{\circ}\\1\\\mathbf{x}_{k}^{*}\end{smallmatrix}\right)$ be a J + 1-dimensional vector of auxiliary variables, where J = J1 + J2. We assume that for J1 variables a vector of population totals τx is known and for J2 variables a vector Qx, α of population quantiles is known. In practice it may happen that for the same auxiliary variable we know its population total and quantile. We do not require that the complete auxiliary information described by the vector xk is known for all k ∈ U; however, for some auxiliary variables unit-population data would be necessary, because accurate quantiles are not likely to be known from other sources (Särndal 2007).
Our main aim is to find new calibration weights wk which are as close as possible to original design weights dk and for some auxiliary variables reproduce known population totals and for the remaining benchmark variables – reproduce known population quantiles exactly. In our joint approach we are looking for a vector w = (w1, …, wn)T which is a solution of the optimization problem $D\left(\mathbf{d},\mathbf{v}\right)=\sum _{k\in s}d_{k}\hspace{2pt} G\hspace{0pt}\left(\frac{v_{k}}{d_{k}}\right) \to \textrm{min}$ subject to the calibration constraints ∑k ∈ svk = N, ∑k ∈ svkxk∘ = τx and $\hat{\mathbf{Q}}_{\mathbf{x},cal,\alpha}=\mathbf{Q}_{\mathbf{x},\alpha}$. In general, J + 1 calibration equations have to be fulfilled. Alternatively, calibration equations can be expressed as ∑k ∈ svkxk∘ = τx and ∑k ∈ svkak = Ta with possible boundary constraints on calibration weights.
Assuming a quadratic metric D(⋅), which is based on $G\left(x\right)=\frac{\left(x-1\right)^{2}}{2}$ function, an explicit solution of the above optimization problem can be derived. This solution is similar to the calibration weights for totals and quantiles. Let $\mathbf{h}_{\mathbf{x}}=\binom{\tau_{\mathbf{x}}}{\mathbf{T_{a}}}$ and $\hat{\mathbf{h}}_{\mathbf{x}}=\binom{\sum_{k\in s}d_{k}\mathbf{x}_{k}^{\circ}}{\sum_{k\in s}d_{k}\mathbf{a}_{k}}$. Then the vector of calibration weights w = (w1, …, wn)T which solves the above optimization problem satisfies the relation:
Remark 1: In our proposed method we assume that we reproduce known population totals and known population quantiles for a set of benchmark variables simultaneously. This approach can be easily extended by assuming that for estimated totals or quantiles are reproduced for some auxiliary variables. Moreover, we assumed that the process of calibration is based on a particular quantile (of order α). For instance, it could be a median i.e. α = 0.5. Nothing stands in the way of searching for calibration weights which reproduce population totals and a set of population quantiles (for example quartiles) for a chosen set of auxiliary variables. Moreover, the proposed method can be easily extended for the generalized calibration, in particular, for not missing at random non-response (Kott and Chang 2010), or empirical likelihood by adding additional constraints on quantiles, i.e. $\sum_{k \in s} p_{k}a_{kj} = \frac{\alpha}{N}$, where j = 1, …, J2 and pk are elements of the vector p = (p1, …, pn)T which is is a discrete probability measure over the sample s (Wu and Thompson 2020).
Remark 2: In our approach we use an interpolated distribution function Hy, s, which is a simple modification of the Heavyside function defined in (). From a practical point of view a smooth approximation to the step function, based on the logistic function can be used i.e. $H\left(x\right)\approx\frac{1}{2}+\frac{1}{2}\tanh{kx}=\frac{1}{1+e^{-2kx}}$, where a larger value of k corresponds to a sharper transition at x = 0.
Remark 3: The proposed method can be easily applied to data from household surveys, where integrated calibration is applied, i.e. when the weights of particular household members should be equal. This can be in particularly useful in the case of EU-SILC, where information from administrative data can be used to provide population distributions for auxiliary variables.