Descriptive Statistics

Learning Objectives

After this unit, students should be able to

describe the measures of location.
describe the measures of dispersion.
describe the measures of association.

Descriptive statistic is summative quantifier for the collection of information. Mathematically it is real-valued function over a set of observations. Unlike inferential statistics in Unit 3, descriptive statistics are not concerned if the set of observations constitute a population or a sample. Descriptive statistics provide us a toolbox to conduct exploratory data analysis. Given a dataset, they help us summarise the data in terms of location - where most of the data is situated, dispersion - what is the spread of the data and association - how various attributes in the data are related to each other. Before we learn these measures, let us fix the notation that we will use in the course.

A population of a dataset \(X\) comprises of \(N\) datapoints: \(x_1, x_2, x_3, ..., x_N\).
A sample is any subset of size \(n < N\).
We use Greek symbols (such as \(\mu, \sigma\)) to denote measures on the population.
We use roman symbols (such as \(\bar{X}, S\)) to denote measures on the sample.

Measures of Location

Measures of location are commonly known as measures of central tendency of data. They quantify the location where most of the data is situated in the dataset. The most popular measures are listed in the table below.

Measure	Description	Comments
Mean	\(\mu = \frac{1}{N} \sum_{i=1}^N x_i\)	Mean can be computed for numerical data.
Median	It equals to the middle value in the sorted dataset.	Median can be computed for ordinal data.
\(k^{th}\) percentile	It equals to the value below which \(k\%\) of the data lies.	Percentile can be computed for ordinal data. Median is \(50^{th}\) percentile.
Mode	It equals to the most frequent value in the dataset.	Mode can be computed for both nominal and ordinal data.

The relationship between mean and median of the data can tell us more about the distribution of the data. If mean of the data equals to the median, then the data follows symmetric distribution. If the mean is greater than the median, the data is skewed towards higher values. If the mean is smaller than the median, the data is skewed towards the smaller values. It is depicted in the figure below.

Mean_Median

Diagnostic

Which measures of location are sensitive to outliers in the data?

Measures of Dispersion

Measures of dispersion quantify the spread of the data around their central tendencies. The most popular measures are listed in the table below.

Measure	Description	Comments
Variance	\(\sigma^2 = \frac{1}{N} \sum_{i=1}^N (x_i - \mu)^2\)	Variance can be computed for numerical data.
Standard deviation	Denoted by \(\sigma\), it equals to the positive square root of the variance.	Unlike the variance, unit of the standard deviation is same as the unit of the measurements in the data.
Range	It equals to the difference between maximum value and minimum values in the dataset.	Range can be computed for numerical data.
Inter-quartile Range (IQR)	It equals to the difference between \(75^{th}\) and \(25^{th}\) percentile in the dataset.	IQR can be computed for numerical data.

Diagnostic

Which measures of dispersion are sensitive to outliers in the data?

Measures of Association

Measures of association quantify the joint variability of two datasets.

Covariance is the most popular measure of association. It quantifies the extent of linear relationship between two dataset. For two datasets \(X\) and \(Y\), it is computed using the following formula:

\[ cov(X, Y) = \frac{1}{N} \sum_{i=1}^n (x_i - \mu_x) (y_i - \mu_y) \]

where \(\mu_x\) and \(\mu_y\) denote the mean of datasets \(X\) and \(Y\) respectively. The unit of the covariance is the product of units of measurements of the individual datasets. Therefore, we can not quantify the association without the help of a domain expert.

Correlation coefficient normalises the covariance to provide a unit-less measure of association. It is computed as follows:

\[ \rho_{XY} = \frac{cov(X, Y)}{\sigma_X \sigma_Y} \]

where \(\sigma_X\) and \(\sigma_Y\) denote the standard deviation of datasets \(X\) and \(Y\) respectively. The value of the correlation coefficient lies between \(-1\) and \(1\). Correlation coefficient of

\(-1\) quantifies perfect anti-correlation between datasets. Increasing values of \(X\) are related to decreasing values of \(Y\).
\(0\) quantifies absence correlation between datasets.
\(1\) quantifies perfect correlation between datasets. Increasing values of \(X\) are related to increasing values of \(Y\).

Remember that the correlation coefficient quantifies only the linear relationship between the dataset. Existence of non-linear correlation can not be quantified using this measure. Consider the following figure for the illustration. There is clearly a quadratic pattern in the sub-figure (f), but the correlation coefficient fails to capture it.

Linear Correlation

Correlation does not imply causation!

It is one the most common rookie mistake by an analyst. Correlation does not imply causation. If we observe a strong positive correlation between ice-cream sales and the shark attacks in a beach town, it will be absurd to infer that the shark-attacks are responsible for the boost in the ice-cream sales. In such cases the positive correlation can be attributed to a third variable, known as the confounding variable, which is not part of the analysis. It can be high temperatures in the case of beach town.

Standard Score

Standard score (also known as \(Z\)-score) is a compound measure constructed using mean and standard deviation. Standard score \(z_i\) for a datapoint \(x_i\) is computed as follows:

\[ z_i = \frac{x_i - \mu}{\sigma} \]

where \(\mu\) and \(\sigma\) denote the mean and standard deviation of the dataset. The standard score lies between \(-\infty\) to \(\infty\). A datapoint with the standard score of \(0\) lies at the mean of the dataset. A datapoint with the standard score of \(z\) quantifies the datapoint to be \(z\) standard deviations away from the mean of the dataset.

Descriptive statistics for the Sample

The measures for sample follow the same formulae as of the population with a slight variation for variance and covariance.

Measure	Formula
Sample mean	\(\bar{X} = \frac{1}{n} \sum_{i=1}^n x_i\)
Sample variance	\(S^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{X})^2\)
Covariance	\(cov(X, Y) = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{X})(y_i - \bar{Y})\)
Correlation coefficient	\(r_{XY} = cov(X, Y) / S_X S_Y\)