Support Vector Machines

Learning Objectives

After this unit, students should be able to

describe the logic of the perceptron.
describe the logic of support vector machine.
describe the logic of logistic regression.
describe the role of the kernel trick.

We can easily convert a classification problem into a regression problem by encoding the class labels as numbers. Consider data about the tumor with their size and malignancy (a binary feature). We can encode the malignancy as \(0\) and \(1\) and fit a linear regression model to the data. A sample is shown in the figure below.

linear_classifier

The main issue with using regression model is depicted in the figure on the right. The regression tends can be extremely sensitive to the feature values among the same class values.

Perceptron Model

In this section we will focus on a very classical first-principle binary classifier named as perceptron model. Let us formally define the problem.

Given a labeled dataset \(D = \{(x_i, y_i)\}\) of \(n\) points where \(x_i \in \mathbb{R}^m, y_i \in \{-1, 1\}\), we want to find a linear approximation \(\mathbf{w}^* \in \mathbb{R}^{m+1}\) such that:

\[ y_i = sign(w^T x_i) \]

where \(sign\) refers to the sign of the number. We want to find \(\mathbf{w}\) such that it minimises misclassification error defined as follows:

\[ \ell_D(\mathbf{w}) = \frac{1}{n} \sum_i \mathbb{I}(y_i \neq \hat{y_i}) \]

\(\mathbb{I}(\cdot)\) in the equation is an identity function that equals to \(1\) if the condition is true, \(0\) otherwise. As explained in the Unit 16, every machine learning model is characterised by hypothesis, which is a linear function in our case, and a goodness-of-fit function, which is misclassification error.

The learning process of the linear separator relies on the basic coordinate geometry principle. Consider a two-dimensional toy dataset as follows. It comprises of two classes shown as blue dots and red squares. We start with a random plane, which is defined by its perpendicular \(\mathbf{w}\). The perceptron labels the datapoint \(\mathbf{x}\) based on the sign of \(\mathbf{w}^t\mathbf{x}\). It is positive if the datapoint is on the die of the plane in the direction of the perpendicular. It is shown in the leftmost plot in the figure below.

The learning process for the linear separator is based on fundamental concepts from coordinate geometry. Imagine a simple two-dimensional dataset with two classes: blue dots and red squares. We begin with a randomly chosen plane, which is defined by its perpendicular vector, \(\mathbf{w}\). The perceptron classifies a data point \(\mathbf{x}\) by determining the sign of \(\mathbf{w}^t\mathbf{x}\). If the result is positive, the data point lies on the side of the plane in the direction of the perpendicular. This scenario is illustrated in the leftmost plot in the figure below.

perceptron

The perpendicular vector is updated as follows: Each point is classified based on the sign of \(\mathbf{w}^t\mathbf{x}\). If a data point is correctly classified, the perpendicular vector remains unchanged. However, for every incorrectly classified data point, the perpendicular vector is adjusted by subtracting the data point from it. This adjustment is shown in the middle plot of the figure. Our goal is to have all data points on the right side of the plane be blue dots. In the current state, the plane incorrectly classifies a red square. After updating, as shown in the rightmost plot, the plane correctly classifies the red square. This process is repeated until all points are correctly classified.

Is the perceptron algorithm guaranteed to converge?

The perceptron algorithm will only converge within a finite number of steps if the data is linearly separable. In practice, an upper limit on the number of updates is often set to stop the algorithm, but there is no assurance that the algorithm will find an optimal solution within that limit. For data that is inherently non-linearly separable, more advanced models or data transformations are necessary to achieve linear separability.

Support Vector Machine (SVM)

If the data is linearly separable, there may exist multiple linear separators. Some of them are shown in the following figure. Which of them should be considered as the best linear separator?

separators

Support vector machines (SVM), in their most primitive forms, are linear classifiers. They tend to choose the linear separator that maximises margin between the separator and the classes. The margin is defined as the distance between the hyperplane and the nearest data points from each class, which are known as support vectors. The larger the margin, the better the generalization ability of the model.

We will not focus on the derivation of the SVM model and provide a sketch to help provide intuition behind SVM². Following figure depicts a two dimensional view of the linear separator.

SVM

Datapoints that lie on a \((m-1)\)-dimensional plane satisfy \(\mathbf{w}^T\mathbf{x} - b = 0\), where \(\mathbf{w}\) is a vector perpendicular to the plane (ref. Equation of a plane). In agreement with the hypothesis of the linear separator,

\[ \begin{aligned} \mathbf{w}^T\mathbf{x} - b &= 1 && ~~~~~\text{Points on one side of the plane belong to class 1} \\ \mathbf{w}^T\mathbf{x} - b &= -1 && ~~~~~\text{Points on the other side of the plane belong to class -1} \\ \end{aligned} \]

Thus, the constraint for the accurate classification of a datapoint \((\mathbf{x}_i, y_i)\) can be succinctly written as: \(y_i(\mathbf{w}^T\mathbf{x}_i - b) \geq 1\). The distance between two planes, i. e. margin, is \(2/\lVert \mathbf{w} \rVert\) where \(\lVert \cdot \rVert\) denotes magnitude or norm of the vector (ref. distance between two planes).

SVM aims to find the linear separator that maximises the margin, which is equivalent to minimising \(\lVert \mathbf{w} \rVert\). Thus, it can be formalised in to the following optimisation problem ¹:

\[ \begin{aligned} & \hat{\mathbf{w}} = \underset{\mathbf{w}}{\operatorname{\arg \min}} ~~ \lVert w \rVert^2 \\ \text{s. t.} ~~~ & y_i(\mathbf{w}^T\mathbf{x}_i - b) \geq 1 ~~~~~\forall i \in \{1, 2, ..., n\} \end{aligned} \]

This kind of SVM is known as hard-margin SVM. The name is derived by the constraint in the problem, which demands accurate classification of every datapoint.

In reality, it might not be possible to accurately classify every single datapoint. A non-negative slack \(\zeta_i\) can be defined for every datapoints. It equals to zero for an accurately classified point and some positive value, otherwise. Under this relaxation, the optimisation problem is formulated as:

\[ \begin{aligned} & \hat{\mathbf{w}} = \underset{\mathbf{w}}{\operatorname{\arg \min}} ~~ \lVert w \rVert^2 + C\sum_i^n \zeta_i\\ \text{s. t.} ~~~ & y_i(\mathbf{w}^T\mathbf{x}_i - b) \geq 1 - \zeta_i ~~\text{and}~~ \zeta_i \geq 0 ~~ \forall i \in \{1, 2, ..., n\} \\ \end{aligned} \]

This kind of SVM is known as soft-margin SVM.

Kernel Trick

The vanilla versions of SVM described earlier provide solution only if the data is linearly separable. However, not all data is linearly separable in its original feature space. One approach to make the data separable by mapping it into a higher-dimensional space where a linear separator may exist. Although this solution seems enticing, it suffers from two issues. Firstly, explicitly mapping original data into a higher-dimensional space can be computationally expensive, especially if the dimensionality of the data is very high. Secondly, there is no trivial solution to find a specifc transfomration function that would achieve linear seaparability.

The dot product between the two vectors (\(\mathbf{w}^T\mathbf{x}\)) is a crucial operation in the computation of linear separators. The kernel trick allows one to directly compute the dot product of two vectors in the a higher-dimensional space without ever explicitly computing the coordinates of the data in that space. A kernel function \(K(\mathbf{x}_i, \mathbf{x}_j)\) computes the dot product in the higher dimensional space. Some popular kernel functions are listed as follows:

Linear Kernel. It is an identity transformation. It does not transform data into higher dimensional space.

\[ K_{linear}(\mathbf{x}_i, \mathbf{x}_j) = \mathbf{x}_i^T\mathbf{x}_j \]

Polynomial kernel. It denotes the distance in higher-dimensional polynomial space.

\[ K_{poly}(\mathbf{x}_i, \mathbf{x}_j) = (\mathbf{x}_i^T\mathbf{x}_j + c)^d \]

RBF Kernel. It denotes the distance in infinite dimensional space of radial basis functions.

\[ K_{RBF}(\mathbf{x}_i, \mathbf{x}_j) = exp\left(-\frac{\lVert \mathbf{x}_i - \mathbf{x}_j \rVert^2}{2\sigma^2}\right) \]

Kernel SVMs use the kernel trick to find optimal hyperplanes in higher-dimensional spaces, enabling them to classify data that is not linearly separable in the original space. The kernel trick is a fundamental concept that enhances the power and flexibility of machine learning algorithms by enabling them to handle non-linear relationships in data without the need for explicit and computationally expensive transformations to higher-dimensional spaces.

Minimising squared norm instead of the norm is mathematical trick in optimisation. You are read more about it on this discussion page. ↩
Please refer to the notes here for the detailed proof. ↩