Chapter 2 : SVM (Support Vector Machine) — Coding

Published in

Machine Learning 101

7 min readMay 4, 2017

It’s not a bug — it’s an undocumented feature.

How well Support Vector Machine perform compared to Naive Bayes? Is it slower to train? Lets explore all such questions in this coding exercise. This is second part of the Chapter 2 :Support vector machine or Support Vector Classifier. If you haven’t read the theory (first part), I would recommend you to go through it here. It is strongly advised that you know the basics behind the SVM classifier.

While you will get fair enough idea about implementation just by reading, I strongly recommend you to open editor and code along with the tutorial. I will give you better insight and long lasting learning.

0. What shall we be doing.

Don’t forget to hit ❤. :)

Coding exercise is the extension of previous Naive Bayes classifier program that classifies the email into spam and non spam. Not to worry, if you haven’t gone through Naive Bayes (chapter 1) (Although I would suggest you to complete it first). The same code snippet shall be discussed in abstract way here as well.

We shall try to reduce the training time by reducing the training data set size by 10% of original. We then vary tuning parameters to increase the accuracy. We shall see how varying kernel, C and gamma changes accuracy and timings.

1. Download

I have created a git repository for the data set and the sample code. You can download it from here (Use chapter 2 folder). In case it fails, you can use/refer my version (classifier.py in chapter 2 folder) to understand working. Ignore plot.py file.

2. Little bit about cleaning

You may skip this part if you have already gone through coding part of Naive Bayes.(this is for readers who have directly jumped here).

Before we can apply the sklearn classifiers, we must clean the data. Cleaning involves removal of stop words, extracting most common words from text etc. In the code example concerned we perform following steps:

To understand in detail, once again please refer to chapter 1 coding part here.

Build dictionary of words from email documents from training set.
Consider the most common 3000 words.
For each document in training set, create a frequency matrix for these words in dictionary and corresponding labels. [spam email file names start with prefix “spmsg”.

The code snippet below does this:def make_Dictionary(root_dir):
   all_words = []
   emails = [os.path.join(root_dir,f) for f in os.listdir(root_dir)]
   for mail in emails:
        with open(mail) as m:
            for line in m:
                words = line.split()
                all_words += words
   dictionary = Counter(all_words)   # if you have python version 3.x use commented version.
   # list_to_remove = list(dictionary)
   list_to_remove = dictionary.keys()for item in list_to_remove:
       # remove if numerical. 
       if item.isalpha() == False:
            del dictionary[item]
        elif len(item) == 1:
            del dictionary[item]
    # consider only most 3000 common words in dictionary.dictionary = dictionary.most_common(3000)return dictionarydef extract_features(mail_dir):
  files = [os.path.join(mail_dir,fi) for fi in os.listdir(mail_dir)]
  features_matrix = np.zeros((len(files),3000))
  train_labels = np.zeros(len(files))
  count = 0;
  docID = 0;
  for fil in files:
    with open(fil) as fi:
      for i,line in enumerate(fi):
        if i == 2:
          words = line.split()
          for word in words:
            wordID = 0
            for i,d in enumerate(dictionary):
              if d[0] == word:
                wordID = i
                features_matrix[docID,wordID] = words.count(word)
      train_labels[docID] = 0;
      filepathTokens = fil.split('/')
      lastToken = filepathTokens[len(filepathTokens) - 1]
      if lastToken.startswith("spmsg"):
          train_labels[docID] = 1;
          count = count + 1
      docID = docID + 1
  return features_matrix, train_labels

3. Entering into world of SVC

The code for using svc is similar to that of naive bayes. We first import the svc from library. Next, we extract training features and labels. Lastly, we ask model to predict the labels for test set. The basic code block snippet looks like below:

from sklearn import svm
from sklearn.metrics import accuracy_scoreTRAIN_DIR = "../train-mails"
TEST_DIR = "../test-mails"dictionary = make_Dictionary(TRAIN_DIR)print "reading and processing emails from file."
features_matrix, labels = extract_features(TRAIN_DIR)
test_feature_matrix, test_labels = extract_features(TEST_DIR)model = svm.SVC()print "Training model."
#train model
model.fit(features_matrix, labels)predicted_labels = model.predict(test_feature_matrix)print "FINISHED classifying. accuracy score : "
print accuracy_score(test_labels, predicted_labels)

Combined all together:

import os
import numpy as np
from collections import Counter
from sklearn import svm
from sklearn.metrics import accuracy_scoredef make_Dictionary(root_dir):
    all_words = []
    emails = [os.path.join(root_dir,f) for f in os.listdir(root_dir)]
    for mail in emails:
        with open(mail) as m:
            for line in m:
                words = line.split()
                all_words += words
    dictionary = Counter(all_words)
    list_to_remove = dictionary.keys()for item in list_to_remove:
        if item.isalpha() == False:
            del dictionary[item]
        elif len(item) == 1:
            del dictionary[item]
    dictionary = dictionary.most_common(3000)return dictionarydef extract_features(mail_dir):
    files = [os.path.join(mail_dir,fi) for fi in os.listdir(mail_dir)]
    features_matrix = np.zeros((len(files),3000))
    train_labels = np.zeros(len(files))
    count = 0;
    docID = 0;
    for fil in files:
      with open(fil) as fi:
        for i,line in enumerate(fi):
          if i == 2:
            words = line.split()
            for word in words:
              wordID = 0
              for i,d in enumerate(dictionary):
                if d[0] == word:
                  wordID = i
                  features_matrix[docID,wordID] = words.count(word)
        train_labels[docID] = 0;
        filepathTokens = fil.split('/')
        lastToken = filepathTokens[len(filepathTokens) - 1]
        if lastToken.startswith("spmsg"):
            train_labels[docID] = 1;
            count = count + 1
        docID = docID + 1
    return features_matrix, train_labelsTRAIN_DIR = "../train-mails"
TEST_DIR = "../test-mails"dictionary = make_Dictionary(TRAIN_DIR)print "reading and processing emails from file."
features_matrix, labels = extract_features(TRAIN_DIR)
test_feature_matrix, test_labels = extract_features(TEST_DIR)model = svm.SVC()print "Training model."
#train model
model.fit(features_matrix, labels)predicted_labels = model.predict(test_feature_matrix)print "FINISHED classifying. accuracy score : "
print accuracy_score(test_labels, predicted_labels)

This is very basic implementation. It assumes default values of tuning parameters (kernel = linear, C = 1 and gamma = 1)

Check out what accuracy do you get in this case?
What is training time? Is it faster/slower than Naive Bayes?
How is accuracy compared to Naive Bayes?

Hmm... how do we decrease the training time?

One way is to reduce the training set size. We reduce it to say 1/10th of the original size and then check the accuracy. Of course it would decrease.

Here we have 702 emails for training. 1/10 would mean 70 emails for training that is very less. (Despite that checkout wonders we can achieve).

Add following lines before training your model. (It reduces the feature_matrix and labels by 1/10th).

features_matrix = features_matrix[:len(features_matrix)/10]
labels = labels[:len(labels)/10]

Now what is training time and accuracy?

Parameter tuning

I guess you would have received accuracy around 56%. That’s too low.

Now keeping the training set as 1/10th, lets try to tune three parameters : kernel, C and gamma.

1. Kernel

Change kernel to rbf. i.e. in model = SVC() add kernel parameter

model = svm.SVC(kernel="rbf", C = 1)

2. C

Next vary C (regularization parameter) as 10, 100, 1000, 10000. Determine whether accuracy increases or decreases?

You will notice that at C = 100, the accuracy score increases to 85.38% and remains almost same beyond that.

3. Gamma

At last, lets play with gamma. Add one more parameter gamma = 1.0

model = svm.SVC(kernel="rbf", C=100, gamma=1)

Oops! The accuracy score dropped. Right? Try higher value of gamma = 10. It dropped further right. Try decreasing. Use values 0.1, 0.01, 0.001. What is the accuracy now? Is it increasing?

You will notice that in this case of exercise, the low gamma values gives us strong accuracy. (Intuition: It would mean the data points are sparse, far enough from decision boundary in graph plot).

In this case, we notice that 85.4 is best we can achieve by reducing training set size. (P.S: What was the accuracy score in Naive Bayes?)

Running script fast [optional]

You may have noticed that every time the script takes lot of time in cleaning and reading data(features and labels) from emails. You can speed up the process by saving the data once extracted from first run.

This will save you lot more time focusing on learning tuning parameters.

Use following snippet to your code to save and load.

import cPickle
import gzipdef load(file_name):
    # load the model
    stream = gzip.open(file_name, "rb")
    model = cPickle.load(stream)
    stream.close()
    return modeldef save(file_name, model):
    # save the model
    stream = gzip.open(file_name, "wb")
    cPickle.dump(model, stream)
    stream.close()#To save
save("/tmp/features_matrix", features_matrix)
save("/tmp/labels", labels)
save("/tmp/test_feature_matrix", test_feature_matrix)
save("/tmp/test_labels", test_labels)#To load
features_matrix = load("/tmp/features_matrix")
labels = load("/tmp/labels")
test_feature_matrix = load("/tmp/test_feature_matrix")
test_labels = load("/tmp/test_labels")

Note: check out classifier.py and classifier-fast.py for reference here.

Final Thoughts

In general the SVC takes more training time than the Naive Bayes but the prediction is faster. In coding exercise above Naive Bayes outperforms the SVC. However, it totally depends on scenario and data set which one performs best.

A greater accuracy score can be achieved even after reducing the training data set to 1/10th.

But, why do we need to reduce training set?

Training time is larger, generally 3x for SVC compared to Naive Bayes. There are application where we need the prediction faster compared to accuracy.

Think of credit card transaction. It is far more important to respond quickly for fraud flag of transaction than 99% accuracy. 90% accuracy can be tolerable here.
On the other hand, only labelling emails into spam or ham may tolerate delays and we can strive for greater accuracy.

Do we need to tune parameters always??

Not really. There are inbuilt functions in sklearn tool kit that does for us. We shall explore them in future post.

Hope this tutorial gave you basic idea about SVC coding. How we can tune parameters and achieve fair accuracy even for small data set size. (We just had 70 emails in training set and achieved 85% accuracy in testing against 350 email) 😊.

What next?

In next chapter, we learn about Decision Trees. (coming soon)

If you liked this post, share with your interest group, friends and colleagues. Comment down your thoughts, opinions and feedback below. I would love to hear from you. Follow machine-learning-101 for regular updates. Don’t forget to click the heart(❤) icon. You can write to me at savanpatel3@gmail.com . Peace.