Machine Learning Theory

There are some derivations that every Data Scientist should know before applying specific machine learning models. In this article I go through all you need to know about SVMs.
Introduction
Kernel methods are extremely popular in the machine learning world due to their ease of use, interpretability and strong performance in a wide variety of applications. Support Vecvtor Machines (SVMs) in particular are strong when dealing with high dimensional data because of their use of kernels, and they are good against overfitting, since they regularize themselves by choosing the biggest margins possible. In this article I will show you the theory behind SVMs and the important derivations you need to understand to apply them effectively.
SVMs and Machine Learning
Machine learning classifiers can be grouped into many categories. Two popular kinds of classifiers are:
- Estimators of posterior probabilities
- Direct estimators of decision boundaries
Models that estimate posterior probabilities are for example naive Bayes classifiers, logistic regression or even neural networks. These models attempt to recreate a posterior probability function. This approximate function of the posterior can then be evaluated to determine the probability that a sample belongs to each class, then make a decision.
Direct estimators of the decision boundary, such as the perceptrons and Support Vector Machines (SVMs), do not try to learn a probability function, instead, they learn a "line" or a high dimensional hyperplane, which can be used to determine the class of each sample. If a sample is to one side of the hyperplane it belongs to a class, otherwise, it belongs to the other.
These two approaches are fundamentally different, and they can affect the results of the classifier.
Kernel Methods
SVMs are a kernel method. Kernel methods use kernels to map the input data space to a higher dimensional space where the data is assumed to be linearly separable. In this new space, a linear classifier is trained and the data is labelled.
But what are kernels?
In the same way a matrix is a linear mapping of a vector space, Kernels are the linear mappings of a function. They allow us to map the input data into a different space where it is hopefully easier to classify the data.
For SVMs, it is important for the kernels to be positive semidefinite. This has to do with the Kernel Trick, which you may have heard of before. If we choose a positive semi-definite kernel, then there will exist a function such that the kernel is equal to the inner product of the extended feature space.
This means we can evaluate the inner product of the extended feature space without directly computing it, saving immense amounts of time and allowing us to use very high dimensional spaces in the classification of the data.
How do SVMs Learn?
In this section I’ll discuss how SVMs generally learn. Later I’ll go through the math to see why they learn the way they do.

SVMs are all about defining a decision boundary. A decision boundary is a hyperplane (or a line if in 2 dimensions) where the line decides what class the data belongs to depending on what side it is on.
To find the best line, SVMs optimize the following criteria:
We want to find the decision boundary that maximises the distance from the closest points to the boundary of each class.
Let me explain; first take an initial boundary, and find the closest points to it from either class. The closest points to the decision boundary are called support vectors, and they are they only points that affect the decision boundary. Once we have the support vectors, we find the line that is furthest away from these support vectors.

Above is an illustration for the support vectors, the margins and the decision boundary. The margins are the space between the decision boundary and the support vectors furthest out. They are particularly useful when the data is not linearly separable.
Mathematically Expressing the Above
Sadly, we can’t just ask a computer to find the line that meets the criteria above. To implement SVMs, we need to express the optimization problem mathematically, and then solve it. I’ll go through it step by step.
Distance of a point from a line
As you may remember from primary school, given the equation of a line, plugging in the coordinates of any point above the line will equate to a value above zero, and any point below the line will equate to a value below zero.

We can express the same equation of the line in vector form and as a function our Kernel ϕ.

In the above equations, the first one is the same equation of the line as in the image above, the second is the same equation in vector form. In the third, the input data space (x) can be mapped to the feature space using the kernel ϕ(x).
The third equation expresses our decision boundary y(x) as a function of the weights and biases (w, b) and the feature space of the data ϕ(x).

The distance between a point and a line can be calculated by the expression in the image above. I’ve included both the vector form version and the version you would have come across in school.
Since that point is below the line, that distance will be negative, and if the point was on top of the line, the distance would be positive.
Problem setting
Let’s bring this back to a classification problem for SVMs.
So we have defined our decision boundary y(x) as a function of our weights and biases (w, b), our kernel function (ϕ) and the data (x).
In any supervised classification problem, we are given the data (x) and the labels of the data (t). For a binary problem, the labels will either be 1 for -1.

The goal is to find the weights and biases (w, b) where y(x) > 0 for one of the classes, and y(x) < 0 for the other class. Remember the image I showed earlier, this means each class will be on either side of the decision boundary.
Objective Function
This finally leads us to the objective function. This is the function that when solved yields the weights and biases (w, b) of our decision boundary.

First look inside the minimization. We are looking for the index n which minimizes the distance between the datapoint (Xn) and the decision boundary y(Xn).
The only difference is that we are multiplying by the class membership tn. The class membership is 1 when the distance is positive, and -1 when the distance is negative. Multiplying by the class cleverly converts the sign of the distance to positive when the classification is correct and negative when it is wrong.
The maximization find weights and biases (w, b) whose decision boundary maximizes the distance between the decision boundary and the closest points.
Solving the Optimisation Problem
To find the weights and biases (w, b) we need to solve the optimization problem I showed above.
Directly solving the problem above would be extremely hard. Instead, we can apply some constraints and create an equivalent optimization problem that is a lot easier to solve.
The added constraint simply scales the evaluation of the line of the closest points to the decision boundary to be equal to 1. Therefore, for any other points in the data it will be greater or equal to 1:

Plugging in the constraint back into our objective function, the optimization problem reduces to:

The minimization over n disappears because w does not depend on n.
This new optimization problem is solvable. It is a quadratic optimization problem with linear inequality constraints, and can be solved using the method of Lagrange multipliers.

In the method of Lagrange multipliers we want to maximize the Lagrangian above. The first term on the right hand side is the objective function, what we are trying to maximize. The second term are the Lagrange mutlipliers, and the third is our an error term derived from the inequality constraint. Maximizing the Lagrangian is equivalent to solving the optimization problem.
To maximize we equate the expression’s derivatives with respect to b and w to 0, then we plug the results back in.

Plugging w back in we get:

This Lagrangian can be solved. Some further constrains are needed (Lagrange multipliers shall be non-negative and the support vectors per class should be the same).
When solving, you always find that all Lagrange multiplers (a) are either 0 or 1. It is actually possible to show almost all Lagrange multipliers are 0 with the exception of very few. The points whose Lagrange multipliers is 0 will have no influence on the decision boundary. The only points to have an influence on the decision boundary are the points with a Lagrange multiplier of 1, (the points inside the margins) the support vectors.
Soft-margin SVM
So far we have assumed the data is linearly separable in the feature space. However, a lot of the time this is not a good assumption. In order to deal with this one can implement a soft-margin SVM.
Soft-margin SVMs allow some of the data to live inside the margin, while applying a small penalty. The derivation for the soft-margin SVM is similar, and introduces the use of a slack variable as a penalty for the number of points inside the margin. The send result is also very similar, with an identical Lagrangian but different constraints.
Here is a soft margin SVM to finish off:

Conclusion
In this article I walk through why SVMs are so powerful and how they learn. Support vector machines are a strong kernel method that can be used to tackle high dimensional problems. They are also good against overfitting due to their margins. Similarly to other kernel methods, SVMs transform the data to a higher dimensional space through the use of kernels where the data is linearly separable. Finally, through the derivations shown, we learnt that SVMs directly estimate their decision boundary and only base it off very few points in the data, the rest of the data does not contribute at all to the decision boundary. Understanding this is super important to know whether SVMs are the best model to apply your specific problems.
Support me
Hopefully this helped you, if you enjoyed it you can follow me!
You can also become a medium member using my referral link, get access to all my articles and more: https://diegounzuetaruedas.medium.com/membership





