Logistic Regression

Learning Objectives

After this unit, students should be able to

appreciate the use of sigmoid function.
describe the probabilistic interpretation of classifier.
describe the use of gradient descent technique.
describe the binary logistic regression model.

Logistic regression is a widely used classification model that models log-odds ratio as a linear function of the predictors in the data. Unlike perceptron and SVM, the output of the logistic regression model is a value between \(0\) and \(1\), which provides a richer probabilistic interpretation for classifiers. In this unit, we will focus on the binary logistic regression for the sake of simplicity. Towards the end we provide ideas to extend the ideas to the multinomial logistic regression model.

Activation Function

activation

We can represent machine learning models using neural networks (we will cover them in the future units) symbolism shown in the adjoining figure. Every neuron computes a weighted linear combination of inputs, \(\mathbf{w}^T\mathbf{x}\). The output of the neuron is, optionally, passed through an activation function. Activation functions introduces non-linearity allowing the model to capture more realistic patterns. Consider following example of activation functions:

Identity Function. Identity function is a trivial activation function that simply outputs the neuron as is. Thus the adjoining figure with identity function as the activation function represents the linear regression model.
Step Function. Step function maps a real number to one of the two values (\(\{0, 1\}\) or \(\{-1, 1\}\)) based on the threshold. If the threshold equals to zero, the adjoining figure with step function as the activation function represents the perceptron model.

Identity function fails to capture non-linear patterns whereas step function poses analytical difficulties in the computation of the parameters (non-differentiability of the step function does not provide any convergence guarantees over the solution). Let us study the activation function used by the logistic regression model.

sigmoid

Sigmoid function, \(\sigma: \mathbb{R} \rightarrow [0, 1]\), is defined as below:

\[ h_\mathbf{w}(\mathbf{x}) = \sigma(\mathbf{w}^T\mathbf{x}) = \frac{\exp(\mathbf{w}^T\mathbf{x})}{1 + \exp(\mathbf{w}^T\mathbf{x})} \]

As the weight tends to increase, the sigmoid function tends to the step function. The sigmoid function as an activation function provides two advantages. Firstly the differentiability of sigmoid function alleviates analytical difficulties in the computation. Secondly, the output of sigmoid function in the range of \([0, 1]\) opens the door for probabilistic interpretation of the model.

Definition

Consider a labeled dataset \(D = \{(x_i, y_i)\}\) of \(n\) points where \(x_i \in \mathbb{R}^m, y_i \in \{0, 1\}\). Let \(p_i\) denote the probability of datapoint to be labeled as \(1\). Binary linear regression problem works on the following hypothesis:

\[ p = \sigma(X\mathbf{w}) = \frac{\exp(Xw)}{1 + \exp(Xw)} \]

where \(X \in \mathbb{R}^{n \times (m+1)}\) is the data matrix and \(\mathbf{w} \in \mathbb{R}^{m+1}\) denotes set of parameters. The actual labels for the datapoints can be assigned by using a threshold for \(p_i\). If \(p_i\) is smaller than a threshold \(\alpha\) then it will be assigned the label \(0\); otherwise it will be assigned the label \(1\).

Maximum Likelihood Estimation

The use of sigmoid function provides a probabilistic interpretation to the classification problem. For a binary classification problem, every datapoint can be consider an independent Bernoulli trial with parameter \(p\). Therefore, the likelihood over the entire dataset can be written as follows:

\[ L(y | X, \mathbf{w}) = \prod_{i=1}^n p^{y_i} (1 - p)^{1-y_i} \]

Taking logarithm on both sides,

\[ \ell_D(\mathbf{w}) = \sum_{i=1}^n y_i \log{p} + (1-y_i) \log{(1-p)} \]

Thus, the logistic regression can be defined as maximisation of the log-likelihood defined by the above equation. Negation of this particular log-likelihood function is known as cross-entropy loss. Therefore, logistic regression can also be defined as minimisation of the cross-entropy loss. Plugging in \(p = \sigma(\mathbf{w}^T\mathbf{x})\) in the equation:

\[ \ell_D(\mathbf{w}) = \sum_{i=1}^n y_i \log{\sigma(\mathbf{w}^T\mathbf{x}_i)} + (1-y_i) \log{(1-\sigma(\mathbf{w}^T\mathbf{x}_i))} \]

In the absence of closed-form solution, thanks to differentiability of the sigmoid function, we can use the technique of gradient descent to find the solution for the optimisation problem.

Gradient Descent Algorithm

Gradient descent is an optimization algorithm used to minimize the loss function in machine learning and deep learning models. It is a fundamental technique for training models, particularly in the context of neural networks and linear models. The basic idea behind gradient descent is to iteratively adjust the model’s parameters (weights and biases) to find the values that minimize the loss function, thereby improving the model's performance.

Let's say we want to find \(\theta\) that minimises the function \(\ell(\theta)\).

Randomly initialise \(\theta \leftarrow \theta_0\).
Update the parameters in the opposite direction of the gradient of the function. \(\alpha\) is known as the learning rate αα. The learning rate determines the size of the steps taken towards the minimum.

\[ \theta_{t+1} \leftarrow \theta_t - \alpha \frac{\partial \ell}{\partial \theta} \]

Repeat the procedure until the loss function converges to a minimum value (or it does not significantly change in the consecutive iterations.)

Please refer to these lecture notes for further details of the gradient descent.

Analysis

If we rearrange the hypothesis equation of the logistic regression, we obtain:

\[ \log \left(\frac{p_i}{1-p_i}\right) = \mathbf{w}^T\mathbf{x}_i = w_0 + w_1x_1 + ... +w_mx_m \]

We can easily interpret this equation as a linear regression model, except the part that the response is not the label but something called as logarithm of the odds ratio.

Odds Ratio

The odds ratio (OR) is a measure of association between two events particularly in the context of binary outcomes and categorical data. It compares the odds of an event occurring in one group to the odds of it occurring in another group. Mathematically, if \(p\) denotes the probability of an event happening then the odds for the event are computed as \(p/(1-p)\). For instance, odds of getting a spade from a pack of cards is \(1:3\). Odds ratio is interpreted as follows:

If odds ratio equals to \(1\), it indicates lack of any association between the two events.
If odds ratio is greater than \(1\), it indicates that the event in question is more likely to happen.
If odds ratio is less than \(1\), it indicates that the the event in question is less likely to happen.

Logistic regression models the log-odds of response as a linear combination of the predictors, and the coefficients of the model can be exponentiated to yield odds ratios, which are easier to interpret. For instance in the earlier equation, the odds (or odds ratio) for a predictor \(X_i\) is computed as \(exp(w_i)\).

Consider an example wherein we train a logistic regression on the dataset of graduate program dataset. We want to predict if the person is accepted in the program based on the gender (encoded as \(0\) for male and \(1\) for female) and score in the entrance exam. It goes as follows:

\[ \log\left(\frac{p}{1-p}\right) = -1.34 -0.56(Gender) + 0.05(Score) \]

Odds ratio for Gender is \(e^{-0.56} = 0.57\). Females are less likely to get accepted in the graduate program.
Odds ratio for Score is \(e^{0.05} = 1.05\). For each unit extra point in the exam, odds of getting accepted in the graduate program increase by approximately \(5\%\).