Bayesian Regression

Learning Objectives

After this unit, students should be able to

compute likelihood provided for a probabilistic model.
derive Bayesian regression.
describe the assumptions of the regression model.
comprehend the duality of Bayesian and classical picture.

The formulation of linear regression model from Unit 17 does not involve any probabilistic modeling, which is a requirement for the Bayesian framework. We introduce an element of randomness in the linear hypothesis for the regression as follows.

Noisy Data Assumption

For a labeled dataset \(D = \{(\mathbf{x}_i, y_i)\}\) of \(n\) points where \(\mathbf{x}_i \in \mathbb{R}^m, y_i \in \mathbb{R}\),

noisy data

the linear noisy data hypothesis is existence of \(\mathbf{b} \in \mathbb{R}^{m+1}\) such that for any datapoint \(i \in \{1, 2, .., n\}\)

\[ y_i = \left( b_0 + \sum_{j=1}^m b_j x_{ij} \right) + \epsilon_i \]

where \(\epsilon_i \sim \mathcal{N}(0, \sigma)\). We also assume that \(\epsilon_i\) are independent given datpaoints \(x_i\). Additionally, all \(\epsilon_i\)s follow normal distribution with a constant variance \(\sigma^2\). We will revisit these assumption at the end of this unit.

Derivation

Let us define \(\mathbf{x}_i'\) as an augmented vector \([1 ~~ \mathbf{x}_i]\) for each datapoint \(\mathbf{x}_i\) for the sake of succinct notation. Using the new notation, the noisy hypothesis can be re-written as follows:

\[ y_i = \mathbf{b}^T\mathbf{x}_i' + \epsilon_i \]

Thus, \((y_i - \mathbf{b}^T\mathbf{x}_i')\) follows normal distribution with mean \(0\) and a constant variance \(\sigma^2\).

Using i.i.d. assumption we can write the likelihood of the data as follows:

\[ \begin{aligned} \ell_D(\mathbf{b}) &= \prod_{i=1}^n Pr[y_i | \mathbf{x}_i, \mathbf{b}] \\ L_D(\mathbf{b}) &= \sum_{i=1}^n \log Pr[y_i | \mathbf{x}_i, \mathbf{b}] && (\text{Log-likelihood}) \\ &\propto - \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - \mathbf{b}^T\mathbf{x}_i')^2 \\ &\propto - (\mathbf{y} - X\mathbf{b})^T(\mathbf{y} - X\mathbf{b}) &&(\text{Using the data matrix notation from Unit 16}) \end{aligned} \]

Maximum likelihood estimate \(\hat{\mathbf{b}}_{ML}\) is computed as follows:

\[ \begin{aligned} \hat{\mathbf{b}}_{ML} &= \underset{\mathbf{b} \in \mathbb{R}^{m+1}}{\operatorname{\arg \max}} ~~ L_D(\mathbf{b}) \\ &= \underset{\mathbf{b} \in \mathbb{R}^{m+1}}{\operatorname{\arg \max}} ~~ - (\mathbf{y} - X\mathbf{b})^T(\mathbf{y} - X\mathbf{b}) \\ &= \underset{\mathbf{b} \in \mathbb{R}^{m+1}}{\operatorname{\arg \min}} ~~ (\mathbf{y} - X\mathbf{b})^T(\mathbf{y} - X\mathbf{b}) \\ \end{aligned} \]

The last line in the derivation equals to the minimisation of the mean squared loss on the dataset. Therefore, ww observe that \(\hat{\mathbf{b}}_{ML}\) equals to the OLS regression estimate (refer to Unit 17).

Assumptions

It is now the right time to explicitly state the assumptions of the linear regression model. They are as follows:

Linearity. Linear regression assumes that the label \(y_i\) is linearly related to \(\mathbf{i}_i\) for any datapoint \(i\).
Independence. All error terms \(\epsilon_i\)s are independent of each other.
Normality. All error terms \(\epsilon_i\)s follow normal distribution.
Homoscadasticity. All error terms \(\epsilon_i\)s follow the distribution with a constant variance \(\sigma^2\).

Diagnostic

Can you spot the places in the derivation where these assumptions are used?