Evaluation Metrics for Linear Regression

Learning Objectives

After this unit, students should be able to

state different kinds of errors in the linear regression.
interpret \(R^2\) metric for the quality of the regression.
interpret adjusted \(R^2\) metric for the quality of the regression.
evaluate the significance of regression coefficients.

Revise Unit 17

Readers should carefully revise Unit 17 before proceeding with this unit.

residual

Consider the case of simple linear regression to predict the test score out of \(20\) based n the GPA of the student. The adjoining scatter plot shows the toy dataset. The solid line shows the regression fit whereas the dashed line shows the average score. We have three kinds of notations for the response for a predictor \(x_i\).

\(y_i\) denotes the actual response in the dataset.
\(\hat{y_i}\) denotes the predicted response as per the regression line.
\(\bar{y}\) denotes the average value of the response.

The difference between the actual and predicted response is called as the residual.

Based on this terminology, there are three kinds of errors are defined for a linear regression model as follows:

Residual Sum of Squares (RSS). It is the sum of squares of the residuals computed using the following formula. If you recall, this is precise quantity that OLS regression minimises.

\[ RSS = \sum_{i=1}^n (y_i - \hat{y_i})^2 \]

Explained Sum of Square (ESS). A naive regression model can always predict the average value of the response as the prediction. Explained sum of squares quantifies how better the trained model is compared to a naive regression model.

\[ ESS = \sum_{i=1}^n (\bar{y} - \hat{y_i})^2 \]

Total Sum of Square (TSS). Total sum of squares quantifies the inherent variability in the response values in the dataset. One can show that the total sum of squares equals to the addition of residual sum of squares and explained sum of squares.

\[ TSS = \sum_{i=1}^n (\bar{y} - y_i)^2 \]

Coefficient of Determination \((R^2)\)

Coefficient of Determination, popularly known as \(R^2\), is a key metric in the analysis of regression. It quantifies the proportion of variance in the response that is explained by the regression model. It is a compound metric defined using various kinds of errors in the regression model as follows:

\[ R^2 = 1 - \frac{RSS}{TSS} = \frac{ESS}{TSS} \]

\(R^2\) value ranges from \(0\) to \(1\). \(R^2\) value of \(0\) quantifies the inability of the model to explain any variability in the response whereas \(R^2\) value of \(1\) quantifies the ability of the model to perfectly explain the variability in the response. Thus, one desires to have the value of \(R^2\) to be closer to \(1\).

Are small \(R^2\) values really bad?

Small values of \(R^2\) are not necessarily bad; their interpretation depends on the context of the analysis. If we are using the regression for predictive analysis, models with small values of \(R^2\) are not useful. If we are using the regression to explore relationships among various features in the data, models with small values of \(R^2\) might still be useful. It is very common to observe small \(R^2\) values while working with social and behaviour science datasets due to unpredictable factors influenced by human behaviour. Training on small datasets might also lead to small values of \(R^2\). Even with the a low \(R^2\) value, the individual predictors may still be statistically significant indicating meaningful relationship between the response and predictors.

Adjusted \(R^2\)

Addition of the predictors in the regression model typically increase the \(R^2\) value for the model. This artificial increment may be inflationary since the newly added predictors may not explain the data well. Adjusted \(R^2\) modifies the \(R^2\) by accounting for the number of predictors in the model as defined in the following equation (\(n\) denotes the number of datapoints and \(m\) denotes the number of predictors). It only increases if the new predictor improves the model more than would be expected by chance.

\[ R^2_{adj} = 1 - \frac{(1-R^2) (n-1)}{n - m - 1} \]

Adjusted \(R^2\) is very useful while performing model selection.

Simple vs. Multiple. We can use adjusted \(R^2\) to choose between simple and multiple regression. Let Model A be a simple linear regression, \(y = a + bx_1\) and Model B be a multiple linear regression \(y = a + bx_1 + cx_2\). We can compute the adjusted \(R^2\) value for each of the model. If the adjusted \(R^2\) for Model B is not significantly larger than model A, one may opt to use Model A due to its simplicity.
Choosing Predictors. We can use adjusted \(R^2\) to decide which predictor should be included in the model. Let Model A be \(y = a + bx_1 + cx_2\) and Model B be \(y = a + bx_1 + c_x3\). We can use the model with higher adjusted \(R^2\).
Evaluation Across Different Datasets. We can use adjusted \(R^2\) to assess the generalisability of the model. To do so, we can train the same model on two datasets. If we observe a sudden drop in the value of adjusted \(R^2\), it indicates the lack of generalisability of the model.

\(R^2\) quantifies the strength of linear relationships.

A high-value of \(R^2\) should not be misconstrued as a perfect model. \(R^2\) quantifies the strength of linear relationships. The following example shows the scatter plot of two datasets, which inherently possess a non-linear relationship. \(R^2\) coefficient for one of them is very small whereas for the other it is very high. Thus, \(R^2\) should not be used as a sole metric for the accuracy of model fit. We will learn about residual plots in the next unit. They help us highlight these issues.

Significance of Coefficients

We typically fit regression model on the dataset available to us, which is likely a small sample taken from an unknown population. While we fit regression model on the available dataset, the true intention of the model is to its predictive power on the unseen dataset. Thus, we expect to learn the regression model on the population data.

residual

The accompanying figure illustrates the difference between a line trained on the entire population (represented by the dashed line) and lines trained on smaller samples (represented by solid lines). We observe that as the sample size increases, the lines trained on samples tend to align with the population line. Therefore, the coefficients of the linear regression function as a random variable, with the randomness arising from the sampling process.

\(t\)-test. One can conduct \(t\)-test to estimate the significance of the estimated regression coefficients. For every coefficient \(b_i\) the null and alternative hypothesis is defined as below:

\[ \begin{aligned} \mathcal{H}_0: &b_i = 0 \\ \mathcal{H}_1: &b_i \neq 0 \end{aligned} \]

If the \(p\)-value is less than \(0.05\), we can say with \(95%\) confidence that the respective coefficient is significant towards explaining the response as follows:

\(F\)-test. The \(F\)-test in multiple linear regression is used to determine whether there is a significant relationship between the response and predictors. It tests the overall significance of the model by comparing a model with no predictors (the null model) to the specified regression model with multiple predictors. It assesses the following null hypothesis:

\[ \begin{aligned} \mathcal{H}_0: &b_1 = b_2 = b_3 = ... = 0 \\ \mathcal{H}_1: &\text{at least one } b_i \neq 0 \end{aligned} \]

If the \(p\)-value is less than \(0.05\), we can say with \(95%\) confidence that there is at least one predictor that explains the response. Overtly, this test of significance may seem useless as it does not help us identify the important feature. \(F\)-test can be useful to select features by trial-and-error strategy when there is a large number of predictors in the dataset.

Linear Regression in Python

sklearn is a Python library offering support for various machine learning models. One such model is LinearRegression, found in the sklearn.linear_models module. While sklearn provides convenient APIs for training a regression model, it lacks a straightforward interface for comprehensive statistical analysis. For this purpose, one can utilize OLS regression from the statsmodels library. A typical output of linear regression performed with statsmodels presents useful evaluation metrics in a single snapshot.

OLS