Naive Bayes Classifier

Learning Objectives

After this unit, students should be able to

describe the naive Bayes classifier.
list the assumptions in the naive Bayes classifier.
analyse the results of naive Bayes classifier.

Age	Edu	Marital	Income	Credit
23	Masters	Single	75k	Yes
35	Bachelor	Married	50k	No
26	Masters	Single	70k	Yes
41	PhD	Single	95k	Yes
18	Bachelor	Single	40k	No
55	Masters	Married	85k	No
30	Bachelor	Single	60k	No
35	PhD	Married	60k	Yes
28	PhD	Married	65k	Yes

Consider the following sample from a credit-card approval dataset. We want to answer questions such as:

Will a 24 years old unmarried bachelor degree holder with an annual income of 50k get the credit card approved?
Will a 45 years old single PhD person with the annual income of 95k be eligible for the credit card?

To solve this problem, we can apply any of the linear classifiers we have described in previous chapters. However, since these classifiers require numerical input, we must preprocess the dataset by converting all categorical variables into numerical equivalents. When the dataset contains a large number of categorical variables, each with many distinct classes, this preprocessing can result in a high-dimensional dataset. In such cases, it is more practical to use a classifier that is more effective with categorical data.

The Naive Bayes classifier is a probabilistic machine learning algorithm based on applying Bayes' theorem with a strong (naive) assumption of independence between the features. It is widely used for classification tasks, particularly in situations where the dimensionality of the input is high.

Definition

Consider a labeled dataset \(D = \{(x_i, y_i)\}\) of \(n\) points where \(x_i\) is a \(m\)-vector of predictors. Each datapoint takes a label, \(y_i\), from a finite set of labels \(\mathcal{C} = \{c_1, c_2, ..., c_l\}\). Naive Bayes classifier uses Bayes rule to assign the label to a datapoint. A datapoint \(\mathbf{x}\) is assigned a label \(y\) that maximises the the conditional probability \(Pr[y|\mathbf{X}]\). Mathematically,

\[ \hat{y} = \underset{c \in \mathcal{C}}{\operatorname{\arg max}} ~~Pr[y = c | \mathbf{x}] \]

Using Bayes rule,

\[ Pr[y = c | \mathbf{x}] \propto Pr[\mathbf{x} | y = c ]Pr[y = c] \]

How should we compute \(Pr[\mathbf{x} | y = c ]\)?

Conditional Independence

The conditional independence assumption refers to the idea that two random variables are independent of each other given knowledge of a third variable. In other words, once you know the value of a certain variable, the occurrence of one event provides no additional information about the occurrence of another event. Mathematically, for three random variables \(X\), \(Y\) and \(Z\), \(X\) and \(Y\) are said to be conditionally independent of \(Z\) if:

\[ P(X, Y | Z) = P(X | Z) P(Y|Z) \]

For instance a doctor is trying perform a diagnosis. A patient describes two different symptoms; which may or may not be dependent on each other. These symptoms are of a disease, with the doctor is trying diagnose. Given that the patient has the disease, both symptoms are explained by the disease and hence independent.

Conditional Independence versus Independence

Two random variables \(X\) and \(Y\) are said to be independent if the occurrence of one event provides no information about the occurrence of the other. Whereas the variables are said to be conditionally independent if they become independent under the knowledge of the value of a third variable. Conditionally independent variables may or may not be independent in the absence of the knowledge of the third variable.

Inference

Naive Bayes model assumes the conditional independence among its predictors given the class label of the datapoint. Thus, for a \(m\)-feature vector, the naive Bayes model computes the probability in the following way.

\[ Pr[\mathbf{x} | y = c ] = \prod_{i=1}^m Pr[\mathbf{x}^i | y = c ] \]

We can estimate the likelihood of encountering a particular class by using the class distribution in the training dataset as an approximation. In cases where the data is imbalanced, a prior distribution can be applied to account for the skewed distribution of datapoints.

Types of Naive Bayes Classifiers

The inference equation for the naive Bayes model does not impose any assumptions about the distribution of individual predictors given the class label. Depending on whether the predictor is categorical or numerical, suitable assumptions regarding the data distributions can be made. Different types of naive Bayes classifiers are designed to handle various types of data accordingly.

Type	Description
Gaussian Naive Bayes	Numerical and continuous predictors that follow a normal distribution. For instance, medical diagnosis based on the continuous metrics.
Multinomial Naive Bayes	Categorical predictors . For instance, spam email detection based on various text features.
Hybrid Naive Bayes	Combination of continuous and categorical data. Analyst has freedom to chose the distribution for the predictors. For instance, spam email detection based on both text and numerical features.
Complement Naive Bayes	The dataset suffers from class imbalance, i.e. it the number of examples of some class is much higher than the number of examples belonging to other classes. In this case, instead of computing the probability of a datapoint belonging to a certain class, the model computes the probability of the datapoint not belonging to the respective class.

Naive Bayes is a generative classifier!

Bayesian models can be classified as generative and discriminative models based on their learning technique. (Unit 18). Naive Bayes classifier learn the joint probability \(P(X, Y)\). They do so by learning both the prior probability (the class distribution) as well as the likelihood, which is the probability of observing predictors given the class. Although naive Bayes classifiers are primarily used for classification, they can be used to generate synthetic datapoints. In order to generate one datapoint,

Sample a class label, \(c\), of the datapoint based on the class distribution.
For each of the \(i \in \{1, 2, ..., m\}\) predictors, sample value for \(x_i\) from \(P(X_i | Y = c)\).