Day 11 of 100DaysofML

Published in

100DaysofMLcode

5 min readJun 27, 2020

SVM or Support Vector Machines.
So I thought of writing a bit on this classifier which unknowingly gives superb results when we club it along with some optimizing algorithm such as K-Fold. I’ll mention more about the model along with a problem that I solved on Kaggle on the same. People would try and use a NN for the same but a model as simple as a SVM worked amazing.

So, what is SVM?
SVM is a supervised learning algorithm that can be used for classification as well as regression. We need to understand the term Hyperplane here. A hyperplane is something that differentiates the two classes very well.

Let us see the given graph whereby there are points of two different classes (shown by two different colors) which are divided by a hyperplane. So the whole point of a hyperplane is just to separate our different classes.

One of the key properties of SVM’s is to optimize the distance or the margin between the hyperplane and the points. Have a look at the diagram below:

Here, the two classes are given by stars and circles and the line C represents the optimize hyperplane which separates the two classifiers. Lines A and B also separate the two points but we need to identify the most optimal which provides the maximum margin between the points and the hyperplane. Hence, C is the optimal one.

From the next diagram, we shall understand the usage of SVM in a much better way:

Two different classes which are difficult to separate

Here, the stars and circles represent thee two classes but it is difficult to classify them using a straight line, this is why we need to consider something like a circle or in a 3D manner in order to be able to seperate the 2 classes.
Take a look at the diagram given below to understand the hyperplane in case of such complex cases.

Here, in the given diagram, we may notice how a circular hyperplane has been created to seperate the two classes.

Sklearn helps us to implement the svm model in a very easy manner and we shall see the implementation in the following manner. For the given problem, I have tried to solve a problem on kaggle and you can see the question and get the datasets from there itself.

Santander Customer Transaction Prediction

Can you identify who will make a transaction?

www.kaggle.com

Let us start by importing the packages.

import pandas as pd
import numpy as np

Next, lets import our training data.

train_data=pd.read_csv('../input/santander-customer-transaction-prediction/train.csv',sep=',')
train_data.head()

train_data.shape
len_train = train_data.shape[0]

We can plot a bit of the data using seaborn just to visualize and try to understand the data.

import seaborn as sns
import matplotlib.pyplot as plt
sns.scatterplot(x=train_data["target"],y=train_data['var_0'])

Like we may see, the values are discrete between 0 and 1 and the var values vary between 0 and 20. So, we need to use a MinMax Scaler which can rescale these values between 1 and -1.

def scale_data(X, scaler=None):
    if not scaler:
        scaler = MinMaxScaler(feature_range=(-1, 1))
        scaler.fit(X)
    X = scaler.transform(X)
    return X, scaler

We need to separate our training data from our training labels and we also need to identify the features from our dataset column and we do that by separating them using pandas.

train_ID=train_data["ID_code"]
train_labels=train_data["target"]
features = train_data.columns[train_data.columns.str.startswith('var')].tolist()
from sklearn.preprocessing import MinMaxScaler, StandardScaler
scaled, scaler = scale_data(np.concatenate((train_data[features].values, test_data[features].values), axis=0))

Now, all the values have been converted between -1 and 1. We need to import our testing data:

test_data=pd.read_csv("../input/santander-customer-transaction-prediction/test.csv")
test_data.head()

We will try to get our accurate train and test data columns.

train_data[features] = scaled[:len_train]
test_data[features] = scaled[len_train:]

We will now drop the unwanted columns which would not be needed during the training.

train = train_data.drop(['target', 'ID_code'], axis=1).values
test=test_data.drop(["ID_code"],axis=1).values

We now create the SVM model using Sklearn and optimize it using K-Fold which I shall explain about in another blog.

from sklearn.model_selection import StratifiedKFold
from sklearn.svm import LinearSVC
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=101)
for i, (train_index, val_index) in enumerate(skf.split(train,train_labels)):
    Xtrain, Xval = train[train_index], train[val_index]
    ytrain, yval = train_labels[train_index], train_labels[val_index]
    model = LinearSVC(C=0.01, tol=0.0001, verbose=1, random_state=1001, max_iter=2000, dual=False)

We shall now fit the model with our training data and labels.

model.fit(Xtrain, ytrain)

We shall now obtain the predicted values and try to predict the accuracy of our given model using the metrics library in sklearn.

y_pred = model.predict(test)
y_predfrom sklearn.metrics import accuracy_score,r2_score
accuracy_score(train_data["target"],y_pred)

I obtained an overall accuracy of 88.7595% for the model which is fairly accurate but it can be optimized using a number of constraints in sklearn itself. The link to my kaggle model is given below:

Santander Customer Transaction

Explore and run machine learning code with Kaggle Notebooks | Using data from Santander Customer Transaction Prediction

www.kaggle.com

Keep Learning.

Cheers.

Day 11 of 100DaysofML

Santander Customer Transaction Prediction

Can you identify who will make a transaction?

Santander Customer Transaction

Explore and run machine learning code with Kaggle Notebooks | Using data from Santander Customer Transaction Prediction

Written by Charan Soneji