Hypothesis Testing

Learning Objectives

After this unit, students should be able to

describe the framework of hypothesis testing.
explain the significance of $p$-value.
formulate null and alternative hypothesis.
conduct hypothesis tests.
interpret the result of the test.

Imagine an e-commerce enterprise aiming to explore whether the mean purchase amount on their site differs significantly from a predefined target value of $\$50$. They gather a random sample of $50$ transactions and find that the average purchase value is $\$52.5$, with a known standard deviation of $\$8$. Is it statistically sound to claim that the platform indeed presents a significantly different cost? Or is this observation merely a result of the sampling process?

Can you solve this problem using the confidence intervals?

Let's assume that widely accepted average to be $\mu = 50$.
Given quantities: $\bar{X} = 52.5, n = 50, \sigma = 8$.
Let's construct $95\%$ confidence interval around the population mean:

\[ \begin{aligned} Pr\left[\left(50 - 1.96 \frac{8}{\sqrt{50}} \right) \leq \bar{X} \leq \left(50 + 1.96 \frac{8}{\sqrt{50}}\right) \right] &=0.95\\ Pr[47.78 \leq \bar{X} \leq 52.22] &= 0.95 \end{aligned} \]

The confidence interval says that: if the population mean is truly $50$, in $95\%$ of the samples of size $50$ drawn from the population the sample mean would lie between $47.78$ and $52.2$.
Since the observed sample mean is outside this interval, the population mean could not have been $50$.
We infer that the population mean is significantly different than $\$50$ and there is up to $5\%$ chance that this inference is incorrect.

Formalism

Hypothesis testing is a framework to conduct the statistical inference. Each test starts with identification of the hypothesis in the experiment. They are termed as follows:

Null Hypothesis ($\mathcal{H}_0$). It is a well-established belief that you want to challenge through the experiment.
Alternate Hypothesis ($\mathcal{H}_1$). It is a belief, contrary to the null hypothesis, that the experiment wants to test.

The result of the hypothesis testing fits in one of the cells of the confusion matrix given below. The wrong inference leads to two kinds of errors termed as Type I error ($\alpha$) and Type II error ($\beta$). Each analyst conducts hypothesis testing with a predetermined Type I error rate, typically set at 5%. This allows the analyst to quantify the level of acceptable error in their decision-making process when they reject the null hypothesis. $\alpha$ is widely known as significance level.

	$\mathcal{H}_0$ is True	$\mathcal{H}_0$ is False
Accept $\mathcal{H}_0$	No error	Type II error ($\beta$)
Reject $\mathcal{H}_0$	Type I error ($\alpha$)	No error

We never accept $\mathcal{H}_0$!

The aim of the hypothesis testing is to challenge the established belief. If analyst rejects $\mathcal{H}_0$, the error in the judgement is upper bounded by a preset $\alpha$. If the analyst is unable to reject the established belief, we give them benefit of the doubt attributing the failure to the wrongly chosen sample for the experiment. Therefore, it is a convention to say "Do not have sufficient evidence to reject $\mathcal{H}_0$" instead of "Accept $\mathcal{H}_0$".

Relationship between Confidence and Significance

Astute readers may have sensed some relationship between confidence level $\gamma$ and significance level $\alpha$ based on the choices of $0.95$ and $0.05$ respectively. They are indeed related to each other by the following equation; but can we logically reason about it?

\[ \gamma = 1 - \alpha \]

Let's revisit the solution for the problem at beginning of this unit using the confidence interval. When we constructed a $95\%$ confidence interval around the population mean, we assumed the correctness of the established belief. According to this assumption, $95\%$ of samples drawn from the population should have their sample means fall within the confidence interval. However, if the sample mean for a particular sample falls outside the confidence interval (as was the case in our example), it suggests a likelihood that the established belief is incorrect. Consequently, we reject the established belief, acknowledging a quantified Type I error rate of $5\%$.

We can indeed use confidence interval to perform hypothesis testing. The region outside the confidence interval is termed as critical region. If the observation lies in the critical region, we reject the null hypothesis.

$p$-value

$p$-value is the probability of obtaining result as extreme as the observation, under the assumption of null hypothesis is correct. A pre-chosen significance level determines an acceptable bound on the Type I error. Thus, inference for the hypothesis testing is performed as follows:

$p \leq \alpha$: Reject the null hypothesis.
$p > \alpha$: No sufficient evidence to reject the null hypothesis.

One-Tailed versus Two-Tailed Hypothesis

Let's say we want to test the average value of a certain quantity. There are three variants of the hypotheses as shown in the table below. The first variant is called as two-tailed test whereas the second and the third variant called one-tailed test. The names are derived from the kind of critical region for each of the hypothesis.

Established Notion	Hypotheses	$p$-value
The average is $\mu_0$. Experiment on a sample yields the sample mean $\bar{X} = t$.	$\mathcal{H}_0: \mu = a, ~ \mathcal{H}_1: \mu \neq a$	$2 * Pr[\bar{X} > t]$ or $2 * Pr[\bar{X} < t]$
The average is at most $a$. Experiment on a sample yields the sample mean $\bar{X} = t$.	$\mathcal{H}_0: \mu \leq a, ~ \mathcal{H}_1: \mu > a$	$Pr[\bar{X} > t]$
The average is at least $a$. Experiment on a sample yields the sample mean $\bar{X} = t$.	$\mathcal{H}_0: \mu \geq a, ~ \mathcal{H}_1: \mu < a$	$Pr[\bar{X} < t]$

What is the $p$-value for the example above?

We assume that the null hypothesis: $\mu = 50$.
Given data: $\bar{X} = 52.5, \sigma = 8, n = 50.$
Using the central limit theorem, $\bar{X} \sim \mathcal{N}(50, 8/\sqrt{50})$.
Probability of observing the outcome at least as extreme as the observed is $Pr[\bar{X} > 52.5] = 0.013$.
This a two-tailed test. Thus, $p$-value $= 2 * .013 = 0.026$.
Since $p$ value is less than $0.05$, we reject the null hypothesis.

Recipe of Hypothesis Testing

Setup the null hypothesis. Identify if it is a two-tailed or one-tailed hypothesis.
Choose a significance level $\alpha$.
Identify the test statistic and sampling distribution.
Compute $p$-value.
If $p$-value is at most $\alpha$, reject the null hypothesis. Otherwise, there is lack of evidence to reject the null hypothesis.

	\(\mathcal{H}_0\) is True	\(\mathcal{H}_0\) is False
Accept \(\mathcal{H}_0\)	No error	Type II error (\(\beta\))
Reject \(\mathcal{H}_0\)	Type I error (\(\alpha\))	No error

Established Notion	Hypotheses	\(p\)-value
The average is \(\mu_0\). Experiment on a sample yields the sample mean \(\bar{X} = t\).	\(\mathcal{H}_0: \mu = a, ~ \mathcal{H}_1: \mu \neq a\)	\(2 * Pr[\bar{X} > t]\) or \(2 * Pr[\bar{X} < t]\)
The average is at most \(a\). Experiment on a sample yields the sample mean \(\bar{X} = t\).	\(\mathcal{H}_0: \mu \leq a, ~ \mathcal{H}_1: \mu > a\)	\(Pr[\bar{X} > t]\)
The average is at least \(a\). Experiment on a sample yields the sample mean \(\bar{X} = t\).	\(\mathcal{H}_0: \mu \geq a, ~ \mathcal{H}_1: \mu < a\)	\(Pr[\bar{X} < t]\)