Why do we need statistics?

Learning Objectives

After this unit, students should be able to

differentiate between population and sample.
reason the need to sampling.
appreciate the need for statistical inference.
explain the pitfalls of the sampling bias

Consider the following the practical questions and the dilemma they pose to the analyst.

A real estate agent wants to estimate the average rental price of the residential properties in a town. Is it practical for the agent to knock on the door of every household in the town and collect the rental prices? How faithful are those prices?
A manufacturing plant producing light-bulbs wants to estimate average life of the light-bulb. Is it pragmatic to burn every light-bulb produced in the plant in order to compute its life?
A pharmaceutical company wants to estimate the effectiveness of its new vaccine. Is it advisable to vaccinate every person in order to assess the effectiveness of the vaccine?

In everyday life, one may come across such numerous questions, who solutions require field experiments. It is not always possible or practical to conduct these experiments to address these inquiries. At time the experiments may even raise ethical concerns. How do we answer such questions?

Statistical Inference

Let us take a detour and introduce some terminologies.

Definitions

Population refers to the entire group of individuals, objects or events to be studied.
Sample refers to any subset of the population.
Statistic is any quantity computed on the sample. Mathematically, a statistic is any real-valued function over the sample.

In the context of a real-estate agent, the population consists of the rental prices of all apartments in the town. A sample, on the other hand, could consist of the rental prices of apartments with odd-numbered zip codes. Statistics such as the average or maximum value are the examples of measures derived from this data.

Statistical inference is the process of using statistical tools, which we will learn in the next couple of units, to make propositions about the measurements on the population using the measurements on the sample. Additionally, statistical inferences are quantified statements about the measurements on the population. For instance, using the sample average of $4000, we may statistically quantify the average rental price in the town to be $4000 with a 5% of error.

Although statistical inference provides a promising way to address the practical problems, we need to be aware of the sampling bias. The statistical inference is faithful only if the sample truly represents the population. Incorrectly drawn samples can lead to wrong statistical inference. In the context of real-estate agent, if the sample includes rental prices of apartments in a single block, it will be a biased sample. If the block in question is close to the metro station, the apartments will naturally have higher rental prices. They will not be accurately represent all apartments in the town.

i.i.d. sample

In statistics, data is often assumed to be comprised of i.i.d. samples drawn from the population. i.i.d. stands for Independently and Identically Distributed. It means that every follows the same (identical) underlying population distribution and the draw of a sample from the population does not depend on the draw of any previously drawn sample.

Diagnostic

How do you define population in the example of the manufacturing plant and the pharmaceutical plant? Can you suggest some samples and statistics for each of these examples?