What is Support Vector Machine (SVM) ?-with Python

4 min readDec 23, 2018

SUPPORT VECTOR MACHINE

CONTENT

What is Support Vector Machine (SVM)

SVM or Support Vector Machine algorithm tries to draw a hyperplane between two classes to seperate them. There can be many hyperplane but SVM’s purpose is finding maximum margin or finding maximum distance between data points from each classes. This data points are nearest data points to hyperlane from each classes. Let’s look image below it explains better.

Also SVM uses a technique called kernel trick to transform the data. If datapoints have low dimensional space and it wouldn’t be able to draw a hyperplane it tries to add a new dimension to data.

Now I’ll explain some parameters in SVM and we’ll try to use SVM to classify voices according to features.

SVM — Parameters

C Parameter
C parameter controls trade-off between training points.

Small C: Large margin
Large C: Small margin, it has potential to overfit.
If you ask which is better to use, answer is ‘it depends on your data’. It would be better if you try different C values to find best score.

Kernel
You can choose the kernel type used by SVM. It can be ‘linear’, ‘rbf’, ‘poly’, ‘sigmoid’, ‘precomputed’. And yes, answer is still same ‘it depends your data’.

Gamma Parameter
It is kernel coefficient. You use it if you choose ‘rbf’, ‘poly’ or ‘sigmoid’ as a kernel.

Also there is ‘degree’ parameters. It is used for ‘poly’ kernel to define degree of polynomial kernel. It is 3 by default.

Import Libraries and Read Data

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns# Read Data
df = pd.read_csv("../input/voice.csv")# First 5 Rows of Data
df.head()

df.columns

df.info()

Visualize Data

sns.pairplot(df, hue='label', vars=['skew', 'kurt',
       'sp.ent', 'sfm', 'mode','meanfun',
       'meandom','dfrange'])
plt.show()

sns.countplot(df.label)
plt.show()

sns.scatterplot(x = 'skew', y = 'kurt', hue = 'label', data = df)
plt.show()

plt.figure(figsize=(20,10))
sns.heatmap(df.corr(), annot=True, linewidth=.5, fmt='.2f', linecolor = 'grey')
plt.show()

Create and Evaluate Model

X = df.drop(['label'],axis=1)
y = df.label

We’ll use 70% of our data to train our model and we’ll test it with 30% of the data.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)# Import SVM
from sklearn.svm import SVC

svm = SVC()
svm.fit(X_train, y_train)y_pred = svm.predict(X_test)from sklearn.metrics import confusion_matrix, classification_reportcm = confusion_matrix(y_test, y_pred)sns.heatmap(cm, annot=True, cmap="Paired_r", linewidth=2, linecolor='w', fmt='.0f')
plt.xlabel('Predicted Value')
plt.ylabel('True Value')
plt.show()

Test Accuracy: 71.29%

Our accuracy is not well and as you can see confusion matrix above our prediction is not good. So let’s try to improve our model. At first we’ll normalize our data after that we’ll apply some parameter optimizations.

# Normalization
X = (X - np.min(X)) / (np.max(X) - np.min(X)).valuesX_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

Let’s fit our model.

svm.fit(X_train, y_train)y_pred = svm.predict(X_test)cm = confusion_matrix(y_test,y_pred)sns.heatmap(cm, annot=True, fmt='.0f', cmap='brg_r')
plt.xlabel('Predicted Value')
plt.ylabel('True Value')
plt.show()

Test Accuracy: 97.27%

Wow! Our score increase to 97.27% and all we did is normalize the data! We can see importance of normalizaton in here. Let’s try to find best parameters for our model.

from sklearn.model_selection import GridSearchCVparam_grid = {'C':[0.1, 1, 10, 100], 'gamma':[1, 0.1, 0.01, 0.001], 'kernel' : ['rbf', 'poly', 'sigmoid', 'linear']}grid = GridSearchCV(SVC(), param_grid, refit=True, verbose=4)grid.fit(X_train, y_train)print("Best Parameters: ",grid.best_params_)

Best Parameters: {‘C’: 10, ‘gamma’: 1, ‘kernel’: ‘rbf’}

grid_pred = grid.predict(X_test)cmNew = confusion_matrix(y_test, grid_pred)sns.heatmap(cmNew, annot=True, fmt='.0f', cmap='gray_r')
plt.xlabel('Predicted Value')
plt.ylabel('True Value')
plt.show()

Test Accuracy: 97.79%

print(classification_report(y_test, grid_pred))

Our test score incresed a little bit again and we reached 97.79% of accuracy.

Thank you for your time!

LinkedIn: https://www.linkedin.com/in/canerdabakoglu/

GitHub: https://github.com/cdabakoglu

Kaggle: https://www.kaggle.com/cdabakoglu