HyperParameter tuning an SVM — a Demonstration using HyperParameter tuning

Cross validation on MNIST dataset OR how to improve one vs all strategy for MNIST using SVM

Published in

Analytics Vidhya

5 min readNov 13, 2019

This is a tricky bit of a business because improving an algorithm can not only be tricky and difficult but also sometimes not fruit bearing and it can easily cause frustration (Sorry I was talking to myself after tearing down half my hair).

Wohoo, let’s start!

SVM’s are a great classification tool that are almost a standard on good enough datasets to get high accuracy.

But improving them can be a bit of a trick but today we’ll improve them using some standard techniques.

Let’s pick a good dataset upon which we can classify and let’s use one vs all strategy on it.

What is one vs all strategy you may ask ?

Well, suppose I train a machine to understand apples in a bowl of fruits which also has oranges, bananas and pears.

Now the machine will first learn how to find an apple and then compare that with oranges, bananas and pears declaring them as not apples.

The same algorithm can be used to find just bananas, just oranges and just pears which helps to find or classify all fruits separately.

This technique is one vs all where we calculate probabilities or classification of one class and then put it against rest of classes instead of just finding this is apple, this is orange etc we go with this is not apple, this is apple, this is not apple and so on.

About the Dataset

To demonstrate this technique we will use the MNIST technique which is a dataset containing numerical letters from 0 to 9.

Using one vs all strategy we first find, what is 1 and not 1, what is 2 and not 2 etc. and then use it to guess the letters we provide as a test.

For our purposes we shall keep a training set and a test set.

let us dig deeper in code —

import numpy as npfrom sklearn.datasets import fetch_openml
mnist = fetch_openml(‘mnist_784’, version=1, cache=True)X = mnist[“data”]
y = mnist[“target”].astype(np.uint8)X_train = X[:60000]
y_train = y[:60000]
X_test = X[60000:]
y_test = y[60000:]

#Loading of the dataset into X and y and segregate it into training and test dataset. Note — we can do this using train_test_split as well.

Time to call the classifier and train it on dataset

from sklearn.svm import LinearSVClin_clf = LinearSVC(random_state=42)
lin_clf.fit(X_train, y_train)from sklearn.metrics import accuracy_scorey_pred = lin_clf.predict(X_train)
accuracy_score(y_train, y_pred)

The accuracy score comes out to 89.5 which is pretty bad , let’s try and scale the training dataset to see if any improvements exist -

from sklearn.preprocessing import StandardScalerscaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.astype(np.float32))
X_test_scaled = scaler.transform(X_test.astype(np.float32))lin_clf = LinearSVC(random_state=42)
lin_clf.fit(X_train_scaled, y_train)y_pred = lin_clf.predict(X_train_scaled)
accuracy_score(y_train, y_pred)

The accuracy score comes out to 92.10 which is better than before but still not great.

Can we do more?

YES

We can use kernels

What are Kernels and why do we use them ?

If I have a graph after plotting my model which does not separate my classes it is recommended to add more degree to my model to help it linearly separate the classes but the cost of this exercise is increasing features and reducing performance of the model, hence kernels.

Kernels are a way in ML to add more flexibility to the algorithm by adding the polynomial degree of the dataset without increasing the features OR

Kernel trick (Source Aurelion Geron)

It makes it possible to get the same result as if you added many polynomial features, even with very high degree polynomials, without actually having to add them.

from sklearn.svm import SVCsvm_clf = SVC(gamma=”scale”)
svm_clf.fit(X_train_scaled[:10000], y_train[:10000]) # We use an SVC with an RBF kernely_pred = svm_clf.predict(X_train_scaled)
accuracy_score(y_train, y_pred)

The accuracy score comes out to be 94.5 which is much better now.

Notice how we’ve only train 1/6th of actual dataset thats because the performance cost of this operation is a lot and there are a lot of hyper parameters to tune, since this can work for us let’s do hyperparameter tuning.

What is hyperparameter tuning ?

Hyper parameters are [ SVC(gamma=”scale”) ] the things in brackets when we are defining a classifier or a regressor or any algo.

Hyperparameters are properties of the algorithm that help classify or regress the dataset when you increase of decrease them for ex.

lin_clf = LinearSVC(random_state=42)

here random_state=42 is a hyperparameter that helps keep the seed state set as 42 which helps the algorithm to pick similar random instances which helps in giving accuracy scores for same instances.

Similarly, each hyperparameter is a property and has it’s own function.

Let me show you a trick to find the best combination of hyperparameters by using ML and run on multiple instances to check scores.

There is a technique called cross validation where we use small sets of dataset and check different values of hyperparameters on these small datasets and repeats this exercise for multiple times on multiple small sets. Then you can find the best values of each hyperparameter.

The usage of multiple small sets is called cross val score and the technique of using random hyperparameter values is called randomized search.

Let me demonstrate this using code —

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import reciprocal, uniformparam_distributions = {"gamma": reciprocal(0.001, 0.1), "C": uniform(1, 10)}#Adding all values of hyperparameters in a list from which the values of hyperparameter will randomly inserted as hyperparameter

rnd_search_cv.best_estimator_> SVC(C=6.7046885962229785, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.004147722987833689,
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

This best estimator gives the best hyperparameter values which we can insert in our algo which have been calculated over by performance score on multiple small sets.

Now that we have the best hyperparameter or we have done hyperparameter tuning we can run this on the entire training dataset and then on test dataset.

Shall we ?

rnd_search_cv.best_estimator_.fit(X_train_scaled, y_train)y_pred = rnd_search_cv.best_estimator_.predict(X_train_scaled)
accuracy_score(y_train, y_pred)y_pred = rnd_search_cv.best_estimator_.predict(X_test_scaled)
accuracy_score(y_test, y_pred)

My accuracy score came out to be 97.2 which is not excellent but it’s good enough and the algorithm isn’t overfitting.

Also, note that we increased accuracy score from 89.5 to 97 which is the real victory here.

We first scaled the input’s and then tuned the hyperparameters.We must note that training 60,000 data point’s isn’t easy and might take a lot of time, so be patient.

If you’re looking for the source code for the same.

Source code > https://github.com/Madmanius/HyperParameter_tuning_SVM_MNIST

HyperParameter tuning an SVM — a Demonstration using HyperParameter tuning

Cross validation on MNIST dataset OR how to improve one vs all strategy for MNIST using SVM

Written by Rohit Madan