Bayesian Learning

Learning Objectives

After this unit, students should be able to

describe the framework of Bayesian learning.
describe the principle of maximum likelihood estimation.
describe the principle of maximum aposteriori estimation.

Bayes' Rule

Please ensure that you have recapitulated Bayes' Rule before reading this chapter.

Consider a random experiment and let \(\mathcal{O}\) denote the random variable defined over the sample space of observation of the random experiment. Let \(\mathcal{R}\) denote the random variable over the space of reasons or drivers that could have given rise to the observation. Following are a few examples of such random experiments.

Random experiment is to predict the stock price of a certain company. Sample space comprises of all possible stock prices. \(\mathcal{R}\) can contain abstract things such as random fluctuations, socio-economical events, history of the company, etc.
Random experiment is to predict whether a fruit falls from the tree. Sample space comprises of two possibilities, namely it falls and it does not fall. \(\mathcal{R}\) can contain abstract things such as gravitational force, external forces, the ripeness of fruit, etc.
Random experiment is to predict the outcome if the coin is tossed. Sample space comprises of two possibilities, namely head and tail. \(\mathcal{R}\) can contain all possible values of \(p\) where \(p\)-denotes the probability of getting a head.

Imagine we made an observation \(o\) and our interest is to find the probability that reason \(r\) is responsible for the observation. We can use Bayes' rule as follows:

\[ Pr[\mathcal{O} = o | \mathcal{R} = r] = \frac{Pr[\mathcal{R} = r | \mathcal{O} = o] Pr[\mathcal{O} = o]}{Pr[\mathcal{R} = r]} \]

This equation, when interpreted deeply, reflects the essence of scientific inquiry. We make observations in the world and form hypotheses that could explain them. Bayes' rule offers a scientific framework to determine which hypothesis is most probable based on the evidence. Isn't this precisely the goal of data analysis? We analyze data, which might hold hidden patterns that caused the observed results. The pursuit of analytics is to discover and quantify these hidden patterns.

Defining Bayesian Learning

We assume that the observed data comes from an unknown data-generating distribution. Let \(\mathcal{D}\) denote the random variable that follows that unknown distribution. Let \(\mathcal{H}\) denote the random variable over the hypothesis set. In machine learning, we are interested in finding the most likely hypothesis that would result in the data. Thus, for a given dataset we can redefine the machine learning as follows:

\[ \hat{h} = \underset{\mathcal{H}}{\operatorname{\arg \max}} ~~ Pr[\mathcal{H}|\mathcal{D} = D] \]

Conventionally, Bayesian community tends to adopt the notation of \(\Theta\) in place of \(\mathcal{H}\). It denotes the space of parameters. We will use this terminology for Bayesian learning. Using Bayes' rule,

\[ Pr[\Theta = \theta | \mathcal{D} = D] = \frac{Pr[\mathcal{D} = D | \Theta = \theta] Pr[\Theta = \theta]}{Pr[\mathcal{D} = D]} \]

Each of the probabilities in the equations have special names.

\(Pr[\Theta = \theta | \mathcal{D} = D]\) is called as the posterior probability.
\(Pr[\mathcal{D} = D | \Theta = \theta]\) is called as the likelihood probability.
\(Pr[\Theta = \theta]\) is called as the prior probability.
\(Pr[\mathcal{D} = D]\) is called as the evidence probability.

The framework of learning itself does not make any assumptions about the probability distributions of the parameters as well as the data. Various assumptions and restrictions on this general framework give rise to various models.

Maximum Likelihood Estimation (ML)

Maximum likelihood estimation adds two restriction to the general framework:

Parameters follow a uniform distribution - i.e. all parameters are equally likely. Thus, \(Pr[\Theta = \theta]\) is a fixed constant value for all parameters values.
Since we are learning on patterns from a single available dataset, \(Pr[\mathcal{D} = D]\) is a fixed constant for all hypotheses.

With these restrictions, the maximum likelihood estimation is defined as the following optimisation problem.

\[ \hat{\theta}_{ML} = \underset{\Theta}{\operatorname{\arg \max}} ~~ \ell(\theta) = \underset{\Theta}{\operatorname{\arg \max}} ~~ Pr[D | \Theta = \theta] \]

For a given dataset, likelihood probability becomes a function of the parameters and it is denoted as \(\ell(\theta)\). Assuming that the dataset consists of i.i.d. sample of \(n\) datapoints, we can compute the likelihood of the dataset as follows:

\[ Pr[\mathcal{D} = D | \Theta = \theta] = \prod_{i=1}^n Pr[\mathcal{D} = d_i | \Theta = \theta] \]

Log-likelihood

Likelihood is a probability, a small value ranging from \(0\) to \(1\). To calculate the overall likelihood for a dataset, the i.i.d. assumption requires multiplying the likelihoods of individual data points. Multiplying these small numbers results in even smaller values. Since floating-point numbers can only be approximately represented in a computer, algorithms tend to lose precision as these values shrink. Logarithm is a monotonically increasing function and the maxmising \(f(x)\) is same as maximising \(\log{f(x)}\). Therefore, it is customary to maximise the logarithm of the likelihood function instead of the likelihood function itself. Log-likelihood function is denoted as \(L(\theta)\).

Maximum Aposteriori Estimation (MAP)

Maximum apostriori estimation lifts the restriction of the uniform prior distribution from the maximum likelihood estimation. Thus, it is defined as the following optimisation problem:

\[ \hat{\theta}_{MAP} = \underset{\Theta}{\operatorname{\arg \max}} ~~ Pr[D | \Theta = \theta] Pr[\Theta = \theta] \]

Although the assumption of the uniform prior simplifies the optimisation problem, many times it is an impractical assumption. For instance, in the random experiment of tossing a coin, uniform prior assumptions says that a coin with the probability of getting a head \(p\) is equal for any value of \(p\) between \(0\) and \(1\) (inclusive). In reality, \(p = 0.5\) is the most probable value. We can incorporate such beliefs by using MAP estimation.

Do you observe the duality?

Minimisation of loss function in the classical picture is synonymous with the maximisation of probability in the Bayesian picture.

Generative vs Descriminative Learning

Bayesian framework of machine learning gives rise to an alternative way of classifying the Bayesian models. Let \(X\) and \(Y\) denote random variables of the features and labels in a dataset respectively. Generative learning models the joint probability distribution \(P(X, Y)\) whereas descriminative learning models the conditional probability distribution \(P(Y|X)\).

Let us understand these ways through the analogy of child learning a language. In this case, \(Y\) are words whereas \(X\) are abstract concepts in the language (such as syntax nouns, verb or even semantic topics such as fruits, flowers, etc.).

	Generative Method	Descriminative Method
Analogy	Imagine a child learning to speak by understanding the entire context of their surroundings and the language spoken around them. The child absorbs the full range of information: not just the words (output \(Y\)) and their meanings but also the various contexts and situations (input \(X\)) in which these words are used. Over time, the child builds a comprehensive model of how language works and can generate new sentences by mimicking the structure and context they have learned.	Imagine a child learning to speak by focusing primarily on distinguishing between specific words and their meanings without necessarily understanding the full context. The child learns by repeatedly hearing and practicing phrases and sentences associated with specific outcomes.
Process	The child learns to recognize that cat is associated with a small, furry animal often seen in the house (input) and can then say cat (output) when they see a similar animal.	The child learns to say cat when shown a picture of a cat by repeatedly seeing images labeled as cat and being corrected or reinforced accordingly. The focus is on mapping specific inputs (pictures) directly to outputs (words).
Capability	Because the child understands the broader context, they can also generate new sentences, such as describing a new scene they have never encountered before by combining known elements in novel ways.	The child becomes very good at correctly identifying and labeling objects they have seen before but might struggle to generate new sentences in unfamiliar contexts because their learning was more focused on direct associations rather than understanding the full context. For instance, the child trained on the pictures of a white cats may not be able to fathom existence of the black cats.

In summary, while a generative approach involves a deeper, more comprehensive learning process that enables the generation of new and contextually appropriate outputs, a discriminative approach focuses on accurately mapping inputs to outputs, often yielding better performance for specific classification tasks but with limited generative capabilities.

Since generative methods model the joint distribution, they do learn both conditional and prior distribution¹. Generative methods have a generative process at the heart of them. A template generative process can be written as follows:

For each datapoint \((x, y)\):

1.1 Sample the feature values \(x\) from the prior distribution \(P(X)\).

1.2 For the sampled value, generate the label \(y\) as a sample from the conditional probability distribution \(P(Y | X = x)\).

The joint distribution can be computed using the law of total probability as follows: \(P(X, Y) = P(Y | X) P(X)\) ↩