Interval Estimation

Learning Objectives

After this unit, students should be able to

describe the idea of confidence intervals.
estimate confidence intervals for the presented use case.

Motivation

Consider the following example. As a student of IT5006 your assignment is to conduct a survey and find the average salary of graduates at NUS. What will be the methodology? What will be the underlying assumptions?

Let's assume that the graduate salary is a random variable that is sampled from an unknown distribution with mean \(\mu\) and standard deviation \(\sigma\).
Since it is not feasible to interview every graduate student, we randomly choose a sufficiently large sample.
Compute the average salary of the sampled students, i.e. the sample mean.

Can we directly report this sample mean to be a proxy for the population mean? Would you say this with absolute certainty?

Interval Estimation

We have been on pursuit of estimating the properties of the population using the observations on the samples. Our study of sampling distributions in Unit 8 taught us about the sample mean serving as a reliable approximation, in expectation, for the population mean. However, the term in expectation warrants clarification. It means that if we were to draw (infinitely!) many samples from the population and compute the mean for each, the (weighted) average of those sample means would converge to the population mean. In practice, we typically work with observations from a single sample. So, what are the odds of this sample yielding an observation closer to the true unknown quantity? Can we offer a more quantified statement?

Definition: Confidence Interval

\([a, b]\) is said to be \(\gamma\%\) confidence interval for a random variable \(X\) if and only if

\[ Pr[a \leq X \leq b] = (\gamma / 100) \]

Let us solve the problem stated earlier in this unit.

The salaries of graduates should not significantly vary within a cohort. Thus we assume a known standard deviation \(\sigma\).
The central limit theorem tells us that for a sufficiently large sample, the sample mean follows the normal distribution.
Using the central limit theorem, let

\[ \frac{\bar{X} - \mu}{\sigma / \sqrt{n}} \sim \mathcal{N}(0, 1) \]

\(95\%\) confidence interval around the sample average salary can be computed as:

\[ \begin{aligned} Pr\left[a \leq \frac{\bar{X} - \mu}{\sigma/\sqrt{n}} \leq b\right] &= 0.95 \\ Pr\left[\left(\bar{X} + a \frac{\sigma}{\sqrt{n}} \right) \leq \mu \leq \left(\bar{X} + b \frac{\sigma}{\sqrt{n}}\right) \right] &=0.95 \end{aligned} \]

We know values for all variables (except \(a\) and \(b\)) that are needed to compute the confidence interval. How to compute \(a\) and \(b\)?

Computing the confidence interval

There is no unique answer and the computation depends on the sampling distribution. Let me show the procedure if the sampling distribution is standard normal distribution. If we are interested in computing \(95\%\) confidence interval, it means that the area under the density curve between \(a\) and \(b\) is \(0.95\). Standard normal distribution being a symmetrical distribution, remaining area of \(0.05\) is equally split in the remaining region as shown in the diagram below.

confidence_interval

If \(F^{-1}\) denotes the inverse cumulative distribution function (refer to the Unit 7) for the standard normal distribution, \(a\) and \(b\) can be computed as follows:

\[ \begin{aligned} a &= F^{-1}(0.025) = -1.96\\ b &= F^{-1}(1 - 0.025) = F^{-1}(0.975) = 1.96 \end{aligned} \]

We can use the following snippet to compute the confidence interval using python.

>>> from scipy.stats import norm
>>> mu, sigma = 0, 1
>>> norm.interval(0.95, loc = mu, scale = sigma)
(-1.959963984540054, 1.959963984540054)

Analysing the confidence interval

The confidence interval offers a quantitative statement about the inference of population parameters using the sample observation. \(95\%\) confidence interval signifies that the mean lies in the confidence interval for \(95\%\) samples chosen from the population. It indirectly puts an upper bound on the error (in this case, it is \(5\%\)) while estimating the population parameters. Confidence intervals are often misinterpreted as the intervals that contains the datapoints with probability \(\gamma\).

The confidence interval not-only depends on value of the confidence, but also the sample size. Smaller sample sizes correspond to narrower confidence intervals (with higher bound on the error) whereas the large sample sizes correspond to wider confidence intervals.