Diving into C-Support Vector Classification

The tricks that the SVC algorithm can do for us

Oct 18, 2022

6 min read

Introduction

Lately I have been studying Scikit-Learn and taking notes of the tools offered by this amazing library. In this post, let’s understand more about the C-Support Vector Machine classifier.

Support Vector Machine (SVM) is a supervised learning algorithm that can be used both for classification or regression. It works for both linear and non-linear calculations of the boundaries, therefore being useful for a handful of problems.

SVMs can be an alternative to other good algorithms like Decision Trees, Logistic Regression or Random Forest, so it is a good addition to one’s set of skills.

Support Vector Machine

A Support Vector Machine is an algorithm that classifies the data points in two categories. Once it is done for all the points, the algorithm starts to trace some lines at the edge of the separation between the two classes, with the objective of maximizing the distance between them. The place where it finds the largest distance is where the best separation line is.

The blue lines are the largest distance that separates both classes. Image by the author.

For linear separated datasets, the algorithm works very well. But, what if our data is not linear? How can we still use it?

If we look at the documentation, we will see that there is this hyperparameter called kernel. The kernel is the logic that the algorithm will use to separate the points into different groups and classify them.

kernel: {‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’} or callable, default=’rbf’

Linear

The linear kernel is pretty straightforward. The SVM will create the lines just like the figure previously presented.

SVC(kernel='linear')

Polynomial

The poly option is for a polynomial kernel. If you look at the shapes of polynomials, you will see that as we increase the degree of it, the more the line creates new curves and becomes more irregular. So, for a model underfitting, it might be a good idea to increase the polynomial degree, making the decision boundary to go around more points. The C is the regularization hyperparameter and the coef0 balances how the model is influenced by high or low degrees of polynomials.

SVC(kernel='poly', degree=3, coef0=1, C=5)

RBF

This kernel rbf is for Gaussian Radial Basis Function. It creates Gaussian distributions to calculate in which one the points will be better fit in to determine how the points will be classified. The hyperparameter gamma makes the Gaussian curves more narrow (high gamma values, more bias) or wide (low gamma values, smoother boundary). So, if we increase gamma, our decision boundary is more irregular. If your model is underfitting, try increasing that number. If it’s overfitting, reduce the gamma. C is the regularization number. It works on similar fashion as the gamma argument.

SVC(kernel='rbf', gamma=4, C=100)

Sigmoid

The kernel sigmoid uses a logic similar to the Logistic Regression, where the probabilities up to 50% go to one class and over that number, it goes to the opposite class. You can use the gamma hyperparameter to regularize.

SVC(kernel='sigmoid', gamma=2)

Precomputed

Finally, this last kernel is for a more advanced/ customized case, where you create your own kernel to run with the model.

Coding

To create a basic SVM using the SVC class from sklearn, it does not take much.

# Imports
import pandas as pd
import seaborn as sns

# Data
from sklearn.datasets import make_classification

# sklearn
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix

Creating a dataset and splitting in train and test.

# Dataset
X, y = make_classification(n_classes=2, n_features=6, n_samples=500, n_informative=2, scale=100, random_state=12)

# Dataframe
df = pd.DataFrame(X, columns=['var'+str(i) for i in range(1, X.shape[1]+1)])
df['label'] = y

#Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=12)

Quick look at the data created.

Dataset for this example. Image by the author.

Let’s create a SVM using the RBF kernel. It is recommended that you scale your data for better results. For this, we can create tuples with the name of the step and the function to run that task. Notice that we are (1) scaling the data; (2) training a SVC with kernel RBF.

steps = [('scaler', StandardScaler()),
         ('svm_classif', SVC(kernel='rbf', gamma=0.5, C=10))]

# Create Pipeline object
rbf_kernel = Pipeline(steps)

# Run the pipeline (fit) 
#Scale data and Fit the model
rbf_kernel.fit(X_train,y_train)

# Predictions
preds = rbf_kernel.predict(X_test)

Let’s look at the performance next.

# performance dataframe
result = pd.DataFrame(X_test, columns=['var'+str(i) for i in range(1, X.shape[1]+1)])

result['preds'] = preds

# Plot var1(on x) and var5(on y)
sns.scatterplot(data=result, x='var1', y='var5', hue='preds');

This yields the plot that follows.

Here is the confusion matrix to see how the model did in terms of classification.

# Confusion Matrix
pd.DataFrame(confusion_matrix(y_test, result.preds, labels=[0,1]))

Confusion Matrix for the SVC. Image by the author.

Very nice! just 5 false positives and 1 false negative and the accuracy score it 94%.

If we train a Random Forest classifier, here is the result.

from sklearn.ensemble import RandomForestClassifier

steps = [('scaler', StandardScaler()),
         ('rf_classif', RandomForestClassifier())]

# Create pipeline
rf = Pipeline(steps)

# Fit
rf.fit(X_train,y_train)

# Preds
preds = rf.predict(X_test)

# performance
result = pd.DataFrame(X_test, columns=['var'+str(i) for i in range(1, X.shape[1]+1)])

result['preds'] = preds

# Confusion Matrix
pd.DataFrame(confusion_matrix(y_test, result.preds, labels=[0,1]))

Confusion Matrix for the Random Forest. Image by the author.

Similar results. The accuracy score here is 92%, slightly lower, but we have to notice that there is no tweak at all for the Random Forest. It can be improved.

Before You Go

I believe it is good to know more algorithms and how they work under the hood. Like many things in data science, the best option will not be this or that algorithm, but the one that presents the best result for your problem. So, knowing one more, you increase your chances of better results.

In this post, we dived in the SVC algorithm, learning how to choose the main hyperparameter for each kernel and how they work, in essence.

Remember, the documentation is your friend and can help a lot. Another great resource is this book Hands-on Machine Learning with sklearn, Keras and Tensorflow. I’ve been reading and enjoying a lot.

Find the code of this post in this repository in GitHub.

Here is my blog, in case you liked this content and want to follow me or find me at Linkedin.