Bayesian Modelling of Bernoulli Trials

Learning Objectives

After this unit, students should be able to

formulate bayesian model for real-world events.
perform ML estimation of Bernoulli Trials.
perform estimation of Bernoulli Trials.
explain the need of conjugate priors.

We have learnt about Bernoulli trials in Unit 7.Bernoulli trials model random experiments with exactly two outcomes. Suppose you have a dataset that records observations from \(n\) independent Bernoulli trials. Our goal is to estimate the parameter \(p\), representing the probability of a "successful event," which would have generated the observed dataset. For example:

A coin is tossed \(n\) times and the dataset contains a sequence of \(H\)s and \(T\)s (heads and tails). We want to estimate the probability of observing the head for the coin.
In a manufacturing process, every manufactured product is checked if it is defective or not. The dataset comprises of the observations of a batch of \(n\) products. We want to know the effectiveness of the plant.
In a pre-election survey conducted by a political party, responses have been recorded as yes and no. The party wants to know the latent parameter in the minds of people that makes them vote for the party.

ML Estimation

Without loss of generality, outcomes of any Bernoulli trial (\(Bernoulli(\theta)\)) can be mapped to \(0\) and \(1\) where \(1\) signifies the successful event. The dataset of \(n\) Bernoulli trials, thus, consists of a stream of \(1\)s and \(0\)s. Let \(n_1\) denote the number of \(1\)s in the dataset. If we assume that a uniform prior, all values of the parameter \(\theta\) between \(0\) and \(1\) and equally likely. We can write likelihood of such a dataset as follows:

\[ \ell_D(\theta) = \theta^{n_1}(1-\theta)^{n - n_1} \]

By differentiating the likelihood function with respect to \(\theta\) and solving it for \(\theta\), we obtain the ML estimate as follows:

\[ \hat{\theta}_{ML} = \frac{n_1}{n} \]

ML estimate equals to the fraction of successful events in the dataset. What if the available dataset does not contain any successful outcome? If the dataset containing outcomes of coin tosses does not contain any \(H\), is it accurate to infer that the probability of observing the head is \(0\)?

Black swan paradox

If a person has never spotted a black swan in his/her lifetime, is "black swans do not exist" an accurate inference? Black swans refer to statistically unexpected or rare events.

Prior Distribution

Use of prior distribution in the Bayesian estimation saves us from the sampling bias. We can incorporate the domain knowledge through the prior distribution.

Conjugate Prior. To simplify the mathematics of MAP estimation, a prior probability distribution is selected such that the posterior distribution remains the same as the prior distribution. We call such a prior distribution conjugate to the likelihood distribution. There are some well-known pairs of distributions. For example:

Normal likelihood and normal prior distribution yield a normal posterior distribution.
Poisson likelihood and exponential prior distribution yield an exponential posterior distribution.
Binomial likelihood and beta prior distribution yield a beta posterior distribution.

Beta Distribution

Beta distribution is a continuous probability distribution which can be used to model any continuous real-valued random variable over the interval \([0, 1]\). Parametrised by two positive numbers \(a\) and \(b\), the probability density function is given as follows:

\[ f(x; a, b) = const \cdot x^{a-1}(1-x)^{b-1} \]

The following plot shows the shape of the density function for various values of parameters \(a\) and \(b\).

MAP Estimation

We know that the dataset with \(n\) independent Bernoulli trials follows the Binomial distribution. Therefore we can use Beta distribution as a conjugate prior. Mathematically,

\[ \begin{aligned} \text{Posterior} &\propto \text{Likelihood} \times \text{Prior} \\[0.5em] &\propto \left( \theta^{n_1} (1-\theta)^{n-n_1} \right) \left( \theta^{a - 1} (1-\theta)^{b-1} \right) \\[0.5em] &\propto Binomial(n_1; n, \theta) \times Beta(\theta; a, b) \\[0.5em] &\propto Beta(\theta; a + n_1, b + n - n_1) \end{aligned} \]

By differentiating the posterior function with respect to \(\theta\) and solving it for \(\theta\), we obtain the MAP estimate as follows:

\[ \hat{\theta}_{MAP} = \frac{n_1 + a - 1}{n + a + b - 2} \]

Provided \(a, b > 1\), MAP estimate does not equal to \(0\) even when the dataset does not contain any successful events.

How to use the beta distribution to encode the prior knowledge? Following observations can be made by looking at the shape of the density function of the beta distribution for various values of the parameters.

For \(a=b=1\), all values of \(x\) between \(0\) and \(1\) are equally likely.
For \(a\) equals to \(b\) (but not equals to \(1\)), we observe higher probability for the values of \(x\) in the vicinity of \(0.5\).
For \(a < b\), we observe higher probability for the values of \(x < 0.5\) .
For \(a > b\), we observe higher probability for the values of \(x > 0.5\).

We can encode our beliefs in the values of \(a\) and \(b\). For instance, if we historically know that the machine effectively manufactures the product \(80\%\) of the times, we can set \(a = 80, b = 20\).

Training versus Tuning!

For a given dataset, the likelihood probability \(\ell_D(\theta)\) is a function of parameters of the model. The parameters of the prior distribution are referred to as the hyperparameters in a Bayesian model. In machine learning, the goal is to optimize an objective function (either by minimizing the loss function, maximizing the posterior, or maximizing the likelihood) to estimate the parameters, a process known as parameter training. Hyperparameters represent prior beliefs and are typically assigned by an expert. In the absence of an expert, hyperparameters are often chosen from a predefined set of values, a process known as hyperparameter tuning.