Classification Metrics

Learning Objectives

After this unit, students should be able to

state various classification evaluation metrics.
qualitatively reason the usage of different metrics.
choose an appropriate metric based on the use case.

In this unit, we will learn about various classification metrics used for evaluating the classification models. These metrics serve the purpose of goodness of fit function defined in the framework of machine learning models in Unit 16. Unless specified, we will focus on binary classifier with two classes: positive class and negative class (typically encoded as \(1\) and \(0\) respectively).

Confusion Matrix

A confusion matrix is a table used to evaluate the performance of a classification model, particularly in supervised learning. It provides a comprehensive breakdown of how the model's predictions compare to the actual class labels. The confusion matrix has the following structure.

	Actual Positive	Actual Negative
Predicted Positive	True Positive (TP)	False Positive (FP)
Predicted Negative	False Negative (FN)	True Negative (TN)

True positives (TP) refer to the instances that are correctly classified as positive.
False negatives (FN) refer to the instances that are incorrectly classified as positive.
True negatives (TN) refer to the instances that are correctly classified as negative.
False positives (FP) refer to the instances that are incorrectly classified as negative.

Following evaluation metrics are defined using the confusion matrix.

Name of the Metric	Formula	Description
Accuracy	\(\frac{TP + TN}{TP + FN + FP + TN}\)	The fraction of overall correct predictions.
Precision	\(\frac{TP}{TP + FP}\)	The fraction of positive labeled datapoints that are actually positive.
Recall (Sensitivity or True Positive Rate)	\(\frac{TP}{TP + FN}\)	The fraction of actual positives that are accurately labeled.
Specificity (True Negative Rate)	\(\frac{TN}{TN + FP}\)	The fraction of actual negatives that are accurately labeled.
\(F1\) score	\(2 \times \frac{Precision \times Recall}{Precision + Recall}\)	The harmonic mean of the precision and recall.

Why do we need so many metrics?

Example 1. Consider a test to determine whether a person suffers from COVID. In this case, misclassifying a COVID positive person maybe considered worse than misclassifying a healthy person. In this case, we need a classifier with: \(Recall > Precision\).
Example 2. Consider an image search engine where I search for the images with certain keywords. Let's say I am searching for images of cats. In this case, showing an image of a dog is worse that not showing all possible images of cats. In this case, we need a classifier with \(Precision > Recall\).

ROC Curve

For every datapoint, any probabilistic classifier outputs a probability distribution over the labels. Probabilistic classifiers require thresholding in order to assign a label to the datapoint. For instance in the following table shows output of a probabilistic binary classifier on a toy dataset. If the output of the classifier is less than the threshold then, then the label is \(0\). Otherwise it is \(1\).

Actual Label	Classifier output	Label with threshold \(0.5\)	Label with threshold \(0.43\)
\(1\)	\(0.45\)	\(0\)	\(1\)
\(0\)	\(0.30\)	\(0\)	\(0\)
\(0\)	\(0.55\)	\(1\)	\(1\)
\(0\)	\(0.25\)	\(0\)	\(0\)
\(1\)	\(0.35\)	\(0\)	\(0\)
\(1\)	\(0.55\)	\(1\)	\(1\)

Each possible threshold value results in a different confusion matrix, which in turn affects the evaluation metrics. This leads to two important questions: which threshold should be selected? and is the classifier generally effective? These queestions can be addressed using the ROC curve.

ROC (Receiver Operating Characteristics)¹ curve is a plot of true positive rate (\(sensitivity\)) versus false positive rate (\(1 - specificity\)) across different threshold values. By varying the decision threshold, the ROC curve shows how the model's ability to discriminate between positive and negative classes changes.

ROC

The adjacent figure displays a typical ROC plot. Each line represents a smoothed curve derived from varying the thresholds of a classifier and recording the corresponding performance metrics. The red dashed line indicates the performance of a random classifier, which assigns positive or negative labels with equal probability. Any effective classifier should outperform this random baseline. We expect our classifier's curve to fall to the left of the dashed line, indicating a higher true positive rate and a lower false positive rate.

AUC (Area Under Curve), also known as AUROC (Area Under ROC), is a comprehensive metric obtained from the ROC curve. It represents the area beneath the ROC curve of the classifier and ranges from \(0\) to \(1\). A random classifier has an AUC of \(0.5\). When comparing the overall performance of two classifiers, the one with the higher AUC is preferred. Thus, among the better classifiers, the blue curve indicates the best classifier.

Multi-class Evaluation

Although we can create a confusion matrix for a multi-class classifier and calculate its accuracy, most other evaluation metrics are designed specifically for binary classification. To achieve a more detailed evaluation of multi-class classification, we can use a one-vs-rest (or one-vs-all) confusion matrix. In this approach, a multi-class problem is broken down into multiple binary classification problems, where each class is compared against all other classes combined. For each class, a separate confusion matrix is created, treating that class as the positive class and all other classes as the negative class.

Consider a confusion matrix for a three class classifier where the rows denote predicted labels and columns denote the actual labels.

	Class 2	Class 1	Class 0
Class 2	8	6	0
Class 1	3	12	1
Class 0	4	2	14

We can convert this confusion matrix into three one-vs-rest confusion matrices, one for each class label. They are given as below:

	Class 2	Others
Class 2	8	6
Others	7	29

	Class 1	Others
Class 1	12	4
Others	8	26

	Class 0	Others
Class 0	14	6
Others	1	29

Micro and macro averaging are techniques used to aggregate one-vs-rest matrices across multiple classes. These approaches help summarize the performance of a classifier by providing single overall metrics. Micro averaging aggregates the contributions of all classes to compute a global performance metric. It treats each instance equally, regardless of its class, and calculates the overall metrics by summing up the individual true positives, false positives, false negatives, and true negatives. Macro averaging calculates the performance metric for each class individually and then takes the average of these metrics. It treats each class equally, regardless of the number of instances in each class.

For instance we can compute macro-averaged and micro-averaged precision for the confusion matrices in the earlier examples as follows:

\[ \begin{aligned} \text{Precision for class 2} &= \frac{8}{8 + 6} = 0.57 \\ \text{Precision for class 1} &= \frac{12}{12 + 4} = 0.75 \\ \text{Precision for class 0} &= \frac{14}{14 + 6} = 0.7 \\ \end{aligned} \]

\[ \begin{aligned} \text{Macro-averaged Precision} &= \frac{1}{3} (0.57 + 0.75 + 0.7) = 0.67 \\ \text{Micro-averaged Precision} &= \frac{(8 + 12 + 14)}{(8 + 12 + 14) + (6 + 4 + 6)} = 0.68 \end{aligned} \]

What is difference between micro and macro averaging?

Micro-averaging treats all instances equally whereas macro-averaging treats all classes equally. Micro-averaging is typically influenced by the majority class. Thus, it is more suitable when class imbalance is present.

The name has its origins in the radar receivers during the World War II. Wikipedia ↩