Simulation Study – Coverage Probability of 95% Confidence Intervals of the Horvitz-Thompson Estimator

2 minute read

This short simulation study examines the coverage probability of 95% confidence intervals of the Horvitz-Thompson estimator of the population mean under simple random sampling without replacement.

95% Confidence Intervals

The Horvitz-Thompson estimator of the population mean:

\[\begin{align*} \bar{y} = \frac{1}{n}\sum_{k \in S} y_k \end{align*}\]

where \(S\) is the set of samples and \(n\) is the sample size.

A corresponding 95% confidence interval:

\[\begin{align*} \left( \bar{y} \pm 1.96\sqrt{\hat{Var}[\bar{y}]} \right) \end{align*}\]

where the variance estimator is

\[\begin{align*} \hat{Var}[\bar{y}] = \left(1 - \frac{n}{N} \right) \frac{s_y^2}{n} \end{align*}\]

and \(s_y^2\) is the sample variance.

Simulation

I start with generating the following populations of survey variable \(y\). Each population is of size N = 10000.

  1. N(10, 25)

  2. Weibull(shape = 0.5, scale = 4)

The simulation contains R = 50000 replication.

simulation_function.R contains a function to execute this simulation.

source("simulation_function.R")

Set the population size.

N = 10000

Simulation: N(10, 25)

set.seed(123)
y.population.1 = rnorm(N, mean = 10, sd = 5)
output.1 = simulation(y.population.1, R = 50000)
output.1
sample.size Rel.Bias.Point.Est Var.Point.Est Rel.Bias.Var.Est Coverage.Prob
1 n=10 0.556 25.574 0.27 0.922
2 n=25 0.116 10.176 -0.104 0.941
3 n=50 0.104 5.067 -0.264 0.947
4 n=100 -0.016 2.524 -0.121 0.949
5 n=500 0.031 0.485 0.042 0.951

Simulation: Weibull(shape = 0.5, scale = 1)

set.seed(123)
y.population.2 = rweibull(N, shape = 0.5, scale = 4)
output.2 = simulation(y.population.2, R = 50000)
output.2
sample.size Rel.Bias.Point.Est Var.Point.Est Rel.Bias.Var.Est Coverage.Prob
1 n=10 0.704 31.57 1.839 0.733
2 n=25 -0.693 12.117 -2.132 0.818
3 n=50 -0.104 6.151 -0.395 0.867
4 n=100 -0.095 3.060 -0.402 0.901
5 n=500 0.095 0.591 0.305 0.938

Observations

  • As sample size \(n\) increases, in general, the magnitudes of the relative biases (the second and the fourth columns) of both point estimators \(\bar{y}\) and \(\hat{Var}[\bar{y}]\) decrease and approach 0. This phenomenon is consistent with law of large numbers. The magnitudes of biases of both estimators shrinks towards zero as we collect more samples.

  • The point estimates from \(\hat{Var}[\bar{y}]\) decrease as sample size \(n\) increases, reflecting the reduction in the typical error of \(\bar{y}\) as we collect more samples. In other words, \(\bar{y}\) is more accurate with more samples (note that \(\bar{y}\) is unbiased, so variance is equal to MSE). This observation is consistent with the above point.

  • The coverage probability of the two-sided confidence interval of \(\bar{Y}\) increases and approaches 0.95 as \(n\) increases. That is, as we collect more samples, in the simulation, the proportion of the confidence intervals computed containing the population mean increases with the proportion approaching 95%. This observation is expected because of a central limit theorem and 1.96 being the 97.5% quantile of the standard normal distribution.

  • Note that as \(n\) increases, the width of the confidence interval decreases because of decreasing \(\hat{Var}[\bar{y}]\). Despite this, the coverage probability increases and approaches 95%. This phenomenon is consistent with the decreasing relative bias of \(\bar{y}\) and \(\bar{y}\) getting closer to the population mean.

  • When \(n = 10\), the coverage probability corresponding to the normal population is much higher than that to the Weibull population (0.92 vs. 0.81), because the Weibull population is heavily-skewed while the normal population is much closer to the standard normal.

  • As \(n\) increases, a central limit theorem starts to apply and hence coverage probabilities of both population approaches 0.95. Note that when \(n = 50\), the coverage probability corresponding to the normal population is very close to 0.95 (0.947) while that to the Weibull population is still 0.867. When \(n=500\), the coverage probability corresponding to the normal population reaches 0.95 while that to the Weibull is still 0.94. Therefore, it takes much more samples for the Weibull distribution to achieve 0.95 through a central limit theorem.