ML Basics (Part-2): Support Vector Machines

In the previous post we learned about the Regression methods. In this article we shall go through a similar but slightly advanced machine learning method called "Support Vector Machine (SVM)". Unless you are already familiar, it is advised that you check out the previous post first before reading this article.

ML Basics (Part-1): REGRESSION – A Gateway Method to Machine Learning

Introduction

If you have followed the previous lesson, you would know by now that the goal of a machine learning model is to find the parameters of a model which best fits the data. In the case of a linear model (e.g., a line), it means to find the best coefficients of a line which passes through all data points. However, it is not possible to fit a line to all the data points which are spread in the form of a cluster so, then it is about finding a line which has the shortest distance from most of the points. This is how a linear Regression works. However, the biggest pitfall of these majority-based methods is that they tend to overfit on the training examples. Which means that they tend to work well on the training-set but fail to predict new examples which were not present in the training-set. This lack of generalizability led to a new class of methods which work conceptually the opposite way. Instead of finding the best ‘fit‘ for the majority, these methods seek to model the separating space between the two classes. Such separating space is called a "Hyperplane". SVM is one of such techniques which tries to learn a hyperplane which best separates the two sets of data points.

SVM accomplishes this task by finding those "outliers" in both clusters of data which exist at the boundary of such hyperplane. Let’s look at the example image in figure 2; If we used a majority-based classifier, it would try to find a single line that separates the blue and red clusters. However, such classification would not be right because it would end up drawing a wrong boundary by disregarding the points farther from the mean of the cluster. However, when more data becomes available, mean shifts and those farther points no longer appear to be outliers. Therefore, such majority-based methods would fail and not be able to produce a generalized solution which works for new samples.

An SVM method solves this problem by finding the most important points (or vectors, drawn from the origin) in the data. These are the points which lie at the boundary of the hyperplane. These points are called Support Vectors. They are named that way because their presence holds the classification boundary. If they are removed, the separating line between the two clusters changes significantly. SVMs try to find those support vectors from the two opposite classes which have the maximum shortest distance between them. This distance is called a margin and the lines at the extreme ends of such margin form the bounds of the decision surface.

Linear SVM

The main strength of SVMs lies in their ability to find non-linear decision boundaries. However, they are also well suited for linearly separable data. We have just seen a simple example of linearly separable data (i.e., soldiers at war) earlier in this article. If you have followed Regression article, you would notice that such examples can also be modelled with Logistic Regression. However, using SVM for linear classification tasks performs better when there is small number of samples and where there is missing data resulting in sparseness. In such cases, SVMs perform better because they don’t model the majority, instead they favor the boundary points.

SVM Objective Function

It was mentioned earlier that the objective of an SVM model is to find the hyperplane which maximizes the margin – the distance between the decision boundaries of two opposite classes. So, we need to define an objective function which finds the best support vectors, thus finding the decision boundary while also maximizing the margin between the two decision boundaries. More specifically, we want to find a weight vector ‘w’ who’s each element is a coefficient for the set of data points. The shortest distance from a support vector to a decision boundary (i.e., point-to-line distance) is 1/||w||² and the total distance between the two decision boundaries (i.e., the margin) is 2/||w||². So, if we want to maximize the margin, we need to minimize the ||w||². The function f(x) in figure 3 shows this objective function.

In addition to objective function, we also need to make sure that the decision boundary is obeyed. For this we must impose a constraint on objective function such that when the points are of one class the output is >=+1 and when they are of other class the output is <=-1. This constraint is represented by the function g(x) in the above set of equations. Now in order to solve this problem, we have to write it in Lagrangian form – a mathematical optimization formulation for constraint-optimization problems. We write the new objective function by following the Lagrangian format of objective function formulation as in above equations. The objective function ‘L’ consists of two parts, the original minimization function f(x) and the constraint function g(x) – the decision boundary constraint. The alpha vector is the set of parameters we would like to optimize however, we also don’t have w or b. We can further simplify this Lagrangian formulation by dual Lagrangian formulation.

We achieve dual Lagrangian formulation by substituting the ‘w’ and ‘b’ in the earlier objective function formulation. We can obtain ‘w’ and by taking the gradient of L with respect to w and b respectively and setting them zero (Recall that gradient is zero at optimum point). This gives us a simplified objective function which only depends on alpha parameter vector for each point in the data-set.

If you have been observant, you should also notice that we have introduced another vector Xⱼ in addition to Xᵢ. This is because we wanted to differentiate the two vectors: the one representing the datapoint and the other representing the weights. However, note that both come from the same set of data-points. So, in this way this product is just a dot product of one vector with another. A dot-product of two vectors gives a measure of similarity. In other words, we want to find two similar vectors from different classes which maximize the margin.

We will do this pair-wise similarity computation between two data-points in the dataset and then find the best value for alpha vector which will be a sparse vector – a vector with most entries being zero and only few being non-zero. This way we will be able to pick the support vectors by using those non-zero coefficients from alpha vector at the end of the optimization process.

Linear SVM Classification using Python

In the previous section(s) we have understood the conceptual understanding of SVMs. We can use python’s scikit library to build a linear SVM model and apply it to an example dataset. We first create a random dataset and split it into a training and a test-set as in figure below.

We can then fit a linear SVM model to training examples and use it to classify the test examples. The resulting hyperplane looks something like in the following figure.

The extreme lines in figure 7 show the decision boundary and the points lying close to those lines are the support vectors which are depicted in figure 8. Note that these are almost outlier points, however, they are the most crucial points because their presence/absence changes the decision boundary.

The classification results for the test-set look something like the given in the below figure. The shaded color areas show the decision boundary of a particular class. The farther the points are from the decision boundary the greater likelihood of them being classified to the respective class. The points lying on the lighter color band of respective colors are the support vectors.

Non-Linear SVM

In the previous section we successfully built a linear SVM model and used it to classify a dataset. Linear SVMs work until we encounter data which is not linearly separable. However, most of the real-world data is not linearly separable. This is where we need a non-linear SVM model for classification.

Non-Linear Objective Function Formulation

In the below figure you can see an example of non-linear data.

If we use the linear SVM model on this dataset we will get result something like in the following figure.

As you can see the linear SVM model failed to classify the datapoint accurately. It is because linear models are not suitable for such data. This is not good as we already mentioned that the biggest advantage of SVMs is the ability to classify non-linear samples accurately. We need to reformulate our objective function for non-linear SVMs. Here is what we end up after accommodating for non-linearity in the data samples.

Note that the major change we have is that we have introduced a kernel function K. If you have followed the Regression lesson, you would be familiar with kernel functions. They are functions which project the data into a hyperspace. These kernel functions can be of different types, some examples are polynomial kernels, Radial Basis kernels and sigmoid based kernel functions.

Non-Linear Classification with Kernel Functions

Now if we use this new formulation on the non-linear dataset, we get the output as in the following figure. We use a radian basis kernel for this dataset because the data samples clearly show circular pattern. As you can see, it has successfully classified all the examples.

As was mentioned earlier, different types of data require different types of kernel functions. We saw an example of an RBF kernel function. There are situations where a different kernel function is more suited. For instance, let’s take an example of the following data. The samples in this data are not forming a radial pattern.

The two clusters are mostly linearly separable until they coincide and mix at the center. We can draw a dissecting curve that can separate these two clusters. This is where we can make use of a polynomial kernel which uses a polynomial function for the decision boundary. We see in the below figure that the classification with such kernel successfully classifies most of the samples accurately.

Final Remarks

In this article, you have learned about Support Vector Machines. You have learned how to formulate objective functions for SVMs and how to build SVM models for linearly separable data. You have also learned how to handle non-linearity in the data by using kernel functions. SVMs are a great tool for many real-world applications and in some cases can outperform even the complicated Neural Network Models (depending on the data). In the coming lessons, we shall explore some more Machine Learning concepts.

For Code Follow the Link:

https://www.github.com/azad-academy/MLBasics-SVM

Support Me on Patreon: