Kernel Support Vector Machines from Scratch

Nishadi de Zoysa
Analytics Vidhya
Published in
6 min readJun 19, 2021

The SVM (Support Vector Machine) is a supervised machine learning algorithm typically used for binary classification problems. That’s why training data is available to train the model.

Three main ideas are behind the SVM:

  • Maximum margin separator: draw the line or hyperplane that maximizes the distance between the separator and the training data, thus introducing a margin slab
  • Soft margin separator: when data with different labels are mixed up, draw the best separator line taking into account the samples within the margin slab
  • Kernel trick: for more complex models in which the data separation boundary is not linear, allow for higher-order polynomials or even not polynomial functions

Let's discuss using SVM with kernel in a descriptive manner in this article.

SVM uses a kernel function to draw Support Vector Classifiers in a higher dimension. Types of Kernel Functions are :

  1. Linear
  2. Polynomial
  3. Radial Basis Function(rbf)

Kernel trick actually refers to using efficient and less expensive ways to transform data into higher dimensions.

Kernel function only calculates the relationship between every pair of points as if they are in the higher dimensions; they don’t actually do the transformation. This trick, calculating the high dimensional relationships without actually transforming data to the higher dimension, is called the Kernel Trick.

Let’s dive into the coding!

Following steps will be followed.

1. Preprocess the Dataset as Specified in the data Mining Process

Reading Dataset.

The data is related to direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed. The goal is to predict if the client will subscribe to a term deposit. Dataset can be accessed by using the given link. https://archive.ics.uci.edu/ml/datasets/bank+marketing#

The below code fragment can be used to load the bank data CSV file into a pandas data frame and it will display the first 10 data points.

Dataset can be analyzed furthermore by using ‘df_original.info()’ and ‘df_original.describe()’.

Handle Missing Values and Outliers.

To do prediction first, it is needed to be done data cleansing. As the first step if there are missing values in the data set it is dropped.

Since the output is false there are no missing values in this dataset. To check whether outliers first it is needed to plot box plots. Outliers are values which are staying away from the rest of the data points. It can't be seen any outliers.

Q-Q Plots and Histograms.

By plotting Q-Q Plots and Histograms, we can analyze if there is transformation is need to do and for what features does it needed to be.

As shown above, probability plots can be drawn for each feature to analyze the behavior and by histograms, we can see that ‘age’,’ duration’, ‘campaign’, ‘previous’ are right-skewed and nr.employed is left-skewed. Therefore it needs to do a transformation.

Transformations.

To identify whether Transformation is needed Q-Q plots and histograms can be used. Then it can be identified whether data points are in a normal distribution. After plotting histograms and Q-Q plots, we can see if a feature is right-skewed or left-skewed. If a feature is right-skewed we use square root transformation and if it is left-skewed we use squared transformation to adjust them to a normal distribution.

After the transformation age feature, the Q-Q plot and histogram can be seen as follow. Likewise, the features are normalized as needed.

Feature Coding.

The purpose of feature coding is to convert categorical text data to numerical values. Here we have two methods,

  1. One-hot encoding
  2. Lable encoding

In one-hot encoding we use vectors and in label-encoding, we use integers as labels. One hot encoding is best when there are one or two categories. But it is difficult to use one-hot encoding when new features are added to the data set. Label encoding is useful if there are many categories it is also a popular way of feature coding.

For the feature coding, a separate data frame being created for categorical features and the fracture ‘y’ has been coded using one-hot encoding. Then y column has been selected as the target to be predicted.

Standardized the Features.

Scale and/or standardized the features is done to make mean=0 and standard deviation=1. We should not scale categorical variables.

Train-test Split

Before continuing with the rest let's split the whole dataset into a training dataset and testing dataset.

The feature that we are supposed to predict should also be split from the rest of the data set. In our case, this will be ‘y’.

Correlation Matrix

Since nr.employed and euribor3m are highly correlated one feature can be dropped. Therefore euribor3m has been dropped.

2. Feature Engineering Using PCA

For dimensionality reduction, Principal Component Analysis (PCA) or Singular Value Decomposition (SVD) can be used. Here I have used PCA. PCA projects the dimensions in a dataset to an eigenvector space. Then we can get the variance ratio to decide which features are to be dropped.

By looking at the variance_ratio explained by the PCA object, we can decide how many features (components) can be reduced without affecting the actual data. I selected 4 components to get high variance dimensions is over 99%.

3. Apply the Support Vector Machines (SVM) with Kernels to Predict the Value

I have used ‘rbf’ kernel where C=1.2 and gamma=0.5 to get high accuracy measure.

gamma: Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’. Higher the value of gamma will try to exactly fit the as per training data set i.e. generalization error and cause an over-fitting problem.

C: Penalty parameter C of the error term. It also controls the trade-off between smooth decision boundaries and classifying the training points correctly.

--

--