Assumptions of Linear Regression

Learning Objectives

After this unit, students should be able to

state the assumptions of linear regression.
assess if the model satisfies the assumptions.
explain the repercussions of violations of the assumptions.
validate the linear regression model.

Revise Unit 17

Readers should carefully revise Unit 17 before proceeding with this unit.

Linear regression models rely on several key assumptions to produce valid and reliable results. These assumptions are crucial for ensuring the accuracy of the estimates, the validity of hypothesis tests, and the generalisability of the model. Here are the primary assumptions of linear regression models:

Linearity. The relationship between the predictors and the response is linear. This means that the change in response is proportional to the change in the predictors.
Independence. The residuals should be independent of each other. This implies that the error term of one observation is not correlated with the error term of another observation.
Homoscedasticity. The residuals should have constant variance at all values of predictors. This means that the spread of the residuals should be approximately the same for all fitted values of the dependent variable. When this assumption is violated, it is called heteroscedasticity.
Normality. The residuals should be approximately normally distributed. This assumption is particularly important for hypothesis testing, such as t-tests for the coefficients.

The first assumptions lies at the heart of the linear regression. Rest of the assumptions are a pure manifest of the Bayesian interpretation of linear regression (Unit 20).

Residual Plots

Residual plots are graphical tools used to diagnose various aspects of a regression model, particularly the assumptions underlying the model. They plot the residuals (the differences between the observed values and the predicted values from the regression model) against various quantities. Following are the two most popular types:

Residual vs Fitted Plot. This is the plot of residuals against the fitted (or predicted) values.
Residual vs Predictor Plot. This is the plot of residuals against the predictor values in the training data.

We will now see how we can use residual plots to assess the assumptions of the linear regression model.

Assessing Linearity

We can use a residual versus fitted plot to evaluate the linearity assumption. Ideally, the residuals should be randomly scattered around zero without any discernible pattern. A curved pattern in this plot suggests that the relationship between the predictors and the response variable is not linear. An example is shown below: the left panel displays the data as a scatter plot with a fitted line, while the right panel shows the residual plot for the model. The pattern in the residual plot indicates non-linearity in the data. This demonstrates that a high \(R^2\) value does not necessarily validate the model

linearty

Assessing Independence

Unless explicitly stated, we always assume that the residuals are normally distributed. This assumption is usually violated for time series data wherein the future responses depend on the past predictors as well as past responses. Durbin-Watson Test is a hypothesis test to check the presence of autocorrelation. The rejection of null hypothesis (i.e. \(p\)-value that is smaller than \(0.05\)) signals that the model adheres to independence assumptions.

Assessing Homoscedasticity

We can use a residual versus fitted plot to evaluate the homoscedasticity assumption. The spread of the residuals should be roughly constant across all fitted values. A "fan" shape pattern (increasing or decreasing spread) suggests heteroscedasticity (non-constant variance of errors). An example is shown below: the left panel displays the data as a scatter plot with a fitted line, while the right panel shows the residual plot for the model. The pattern in the residual plot indicates heteroscedasticity in the data. This demonstrates that a high \(R^2\) value does not necessarily validate the model.

homoscedasticity

Assessing Normality

We can use a Quantile-Quantile (QQ) plot to evaluate normality assumptions. A QQ plot compares the quantiles of the data against the theoretical quantiles of an expected probability distribution. For example, to assess normality, we would plot the quantiles of the residuals against the quantiles from a normal distribution with constant variance. If the points fall along a 45-degree line, it indicates that the normality assumption holds for the model.

Multicollinearity

Multicollinearity occurs when two or more predictor variables in a regression model are highly correlated, meaning they have a strong linear relationship with each other. This correlation can make it difficult to isolate the individual effect of each predictor on the response. Perfect multicollinearity occurs when one predictor is perfectly correlated with one or more other predictors. It is said to be imperfect multicollinearity if the correlation is very high but not perfect.

Multicollinearity may not significantly affect the quantitative predictive performance of the model; but it severely affects the qualitative interpretation of the model. Consider that a certain response \(y\) relates to a predictor \(x_1\) as \(2x_1\). Let \(x_2\) be another predictor that is perfectly correlated with \(x_1\). If we fit a multiple linear regression model, any of the following models (and many more) are plausible answers with different physical interpretations.

\[ \begin{aligned} y &= x_1 + x_2 \\ y &= 0.8x_1 + 0.2x_1 \\ y &= 2x_2 \\ ... \end{aligned} \]

Perfect multicollinearity is equally a trouble for coefficient estimation. If two predictors are perfectly correlated, the closed form for the regression (refer to Unit 17) does not have a solution due to singularity of the matrix. Therefore, we have to assess for the presence of multicollinearity before we train the model. Multicollinearity can be detected in one of the following ways:

Correlation matrix. We can compute correlation matrix among all predictors in the dataset. Very high value of correlation coefficient (\(> 0.9\)) is a clear indicator of the multicollinearity.
Variance Inflation Factor (VIF). WIF measures how much the variance of a regression coefficients is inflated due to multicollinearity. VIF for \(i^{th}\) predictor is computed by regressing \(i^{th}\) predictor on the rest of them. A value larger than \(10\) is a clear indication of the multicollinearity.

As mentioned earlier, multicollinearity does not necessarily cause problems with the predictive power of the model; it indeed leads to unstable coefficients and hence their misinterpretation. Multicollinearity in the dataset can be removed by dropping one of the correlated feature. One may use dimensionality reduction techniques such as PCA to get rid off such features.