Linear Regression

Learning Objectives

After this unit, students should be able to

derive OLS regression model.

Revision: Equation of Line

Do you remember the equation of line from the coordinate geometry? The equation of line in two dimensional Euclidean space is given as follows:

\[ y = a + bx \]

where \(a\) is the intercept and \(b\) is the slope of the line. We can extrapolate this equation to higher dimensions. The equation of hyperplane is given as follows:

\[ y = b_0 + b_1 x_1 + b_2 x_2 + ... + b_m x_m \]

We can succinctly write it using vector notation as: \(y = \mathbf{b}^T \mathbf{x}\) where \(\mathbf{b}\) and \(\mathbf{x}\) are \(m\)-dimensional real vectors, i.e. \(\mathbf{b}, \mathbf{x} \in \mathbb{R}^m\).

Definition

Given a labeled dataset \(D = \{(x_i, y_i)\}\) of \(n\) points where \(x_i \in \mathbb{R}^m, y_i \in \mathbb{R}\), the linear regression problem is to find linear approximation \(b^* \in \mathbb{R}^{m+1}\) such that:

\[ y_i = b_0^* + \sum_{j=1}^m b_i^* x_{ij} \]

This is known as the linear hypothesis since it is an ad-hoc assumption of the linear relationship between the data and labels. We can succinctly write the linear hypothesis in the matrix-vector notation as follows.

\[ \mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} = \begin{bmatrix} 1 & x_{11} & x_{12} & \dots & x_{1m} \\ 1 & x_{21} & x_{22} & \dots & x_{2m} \\ \vdots & \vdots &\vdots & \dots & \vdots \\ 1 & x_{n1} & x_{n2} & \dots & x_{nm} \\ \end{bmatrix} \begin{bmatrix} b_0^* \\ b_1^* \\ \vdots \\ b_m^* \end{bmatrix} = X\mathbf{b}^* \]

Colloquially, \(X \in \mathbb{R}^{n \times (m + 1)}\) is called as the data matrix and \(\mathbf{b}^* \in \mathbb{R}^{(m+1)}\) is the called as the vector of regression coefficients.

Linear regression aims to find the "true" coefficient vector \(\mathbf{b}^*\); but it is not always possible to draw a single line that passes through all datapoints in the specified dataset. It may pass through some; but for others it may result in error, called as the residual. It is a convention to denote estimated values using a \((~\hat{}~)\) notation.

residual

We use mean squared error that computes the sum of squared residuals over the available dataset. It is defined as follows:

\[ \begin{aligned} \ell_D(\mathbf{b}) &= \frac{1}{2n} \sum_{i=1}^n (y_i - y)^2 \\ &= \frac{1}{2n} (\mathbf{y} - \hat{\mathbf{y}})^T(\mathbf{y} - \hat{\mathbf{y}}) \\ &= \frac{1}{2n} (\mathbf{y}^T - \mathbf{b}^T X^T)(\mathbf{y} - X \mathbf{b}) \\ &= \frac{1}{2n} (\mathbf{y}^T\mathbf{y} - 2\mathbf{y}^T X \mathbf{b} + \mathbf{b}^T X^T X \mathbf{b}) \end{aligned} \]

Diagnostic

What is the hypothesis set and the goodness of fit function for the linear regression?

Thus, we want to find the linear approximation that minimises the mean squared error on the given dataset \(D\). The following variant of the linear regression model is called as ordinary least squared (OLS) regression.

\[ \hat{\mathbf{b}} = \underset{\mathbf{b} \in \mathbb{R}^{m+1}}{\operatorname{\arg \min}} ~~ \ell_D(\mathbf{b}) \]

Derivation

The loss function - mean squared error - is a convex function. Convex functions have nice properties such as differentiability and existence of the minimum value. Therefore, the root of the first derivative of the loss function equals to the estimate for \(\mathbf{b}\) that minimises the loss function. It is derived as follows:

\[ \begin{aligned} \frac{d \ell_D{\mathbf{b}}}{d\mathbf{b}} &= \frac{1}{2n} \frac{d}{d\mathbf{b}} (\mathbf{y}^T\mathbf{y} - 2\mathbf{y}^T X \mathbf{b} + \mathbf{b}^T X^T X \mathbf{b}) = 0 \\ \therefore \hat{\mathbf{b}} &= (X^TX)^{-1}(X^T\mathbf{y}) \end{aligned} \]

Voil`a! We have a closed form solution to the seemingly complicated problem. This equation is known as normal equation of the linear regression.

Comments¹

Several important considerations arise when closely examining the closed-form solution. Specifically, the solution requires the inverse of the matrix \(A^TA\). However, not all matrices are invertible! The solution exists only when the matrix is invertible. Non-singular matrices, those without zero determinants and full-rank, are invertible. In order to have a full rank for \(A^TA\), the data matrix should satisfy following properties:

The number of rows must be at least as many as the number of columns of the matrix. This translates to: the number of datapoints must be at least as many as the number of features in the dataset.
No two rows are linearly related to each other. This translates to: no two datapoints can be exactly same or same up to multiplicative scale.
No two columns have perfect correlation. This translates to: no two features are perfectly correlated to each other.

Readers without a background in Linear Algebra will have a hard time to follow this discussion. They can skip the jargon and focus on the key takeaways at the end of the section. ↩

Linear Regression

Definition

Derivation

Comments1

Comments¹