In this second part we try to explore sklearn library’s decision tree classifier. We shall tune parameters discussed in theory part and checkout accuracy results.
While you will get fair enough idea about implementation just by reading, I strongly recommend you to open editor and code along with the tutorial. I will give you better insight and long lasting learning.
0. What shall we be doing.
Don’t forget to hit ❤
Coding exercise is the extension of previous Naive Bayes classifier program that classifies the email into spam and non spam. Not to worry, if you haven’t gone through Naive Bayes (chapter 1) (Although I would suggest you to complete it first). The same code snippet shall be discussed in abstract way here as well.
I have created a git repository for the data set and the sample code. You can download it from here (Use chapter 3 folder). In case it fails, you can use/refer my version (classifier.py in chapter 3 folder) to understand working.
2. Little bit about cleaning
You may skip this part if you have already gone through coding part of Naive Bayes.(this is for readers who have directly jumped here).
Before we can apply the sklearn classifiers, we must clean the data. Cleaning involves removal of stop words, extracting most common words from text etc. In the code example concerned we perform following steps:
To understand in detail, once again please refer to chapter 1 coding part here.
- Build dictionary of words from email documents from training set.
- Consider the most common 3000 words.
- For each document in training set, create a frequency matrix for these words in dictionary and corresponding labels. [spam email file names start with prefix “spmsg”.
The code snippet below does this:def make_Dictionary(root_dir):
all_words = 
emails = [os.path.join(root_dir,f) for f in os.listdir(root_dir)]
for mail in emails:
with open(mail) as m:
for line in m:
words = line.split()
all_words += words
dictionary = Counter(all_words)# if you have python version 3.x use commented version.
# list_to_remove = list(dictionary)
list_to_remove = dictionary.keys()for item in list_to_remove:
# remove if numerical.
if item.isalpha() == False:
elif len(item) == 1:
# consider only most 3000 common words in dictionary.dictionary = dictionary.most_common(3000)return dictionarydef extract_features(mail_dir):
files = [os.path.join(mail_dir,fi) for fi in os.listdir(mail_dir)]
features_matrix = np.zeros((len(files),3000))
train_labels = np.zeros(len(files))
count = 0;
docID = 0;
for fil in files:
with open(fil) as fi:
for i,line in enumerate(fi):
if i == 2:
words = line.split()
for word in words:
wordID = 0
for i,d in enumerate(dictionary):
if d == word:
wordID = i
features_matrix[docID,wordID] = words.count(word)
train_labels[docID] = 0;
filepathTokens = fil.split('/')
lastToken = filepathTokens[len(filepathTokens) - 1]
train_labels[docID] = 1;
count = count + 1
docID = docID + 1
return features_matrix, train_labels
Entering the world of Decision Tree Classifier
The code for decision tree classifier is similar to previous two classifiers Naive Bayes and SVM. We import tree library. Next we extract the features and labels. We train them into model. And then predict. Later, we shall check accuracy.
from sklearn import tree
from sklearn.metrics import accuracy_scoreTRAIN_DIR = "../train-mails"
TEST_DIR = "../test-mails"dictionary = make_Dictionary(TRAIN_DIR)print "reading and processing emails from file."features_matrix, labels = extract_features(TRAIN_DIR)
test_feature_matrix, test_labels = extract_features(TEST_DIR)model = tree.DecisionTreeClassifier()print "Training model."
model.fit(features_matrix, labels)predicted_labels = model.predict(test_feature_matrix)print "FINISHED classifying. accuracy score : "
print accuracy_score(test_labels, predicted_labels)
What was the accuracy score? You will have received around 91.53%.
Now let’s explore some tuning parameters and try to make training faster.
Minimum sample split
Ideally, decision tree stops splitting the working set based on features either it runs out of features or working set ends up in same class. We can make faster by tolerating some error at minimum split criteria. With this parameter, decision tree classifier stops the splitting if the number of items in working set decreases below specified value.
Following is the diagram where Minimum Sample split is 10.
Default in sklearn library is 2.
Try providing this parameter as 40
model = tree.DecisionTreeClassifier(min_samples_split=40)
What is the accuracy here? You will have got accuracy around 87.3%.
Criteria for split : criterion
In theory part we learned that one of good spliting decision is to take one that provides the best information gain. Criteria for sklearn can be gini or entropy( for information gain). The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.
Try these two and checkout what is accuracy.
model = tree.DecisionTreeClassifier(criterion="entropy")
model = tree.DecisionTreeClassifier(criterion="gini")
You can find detailed parameters here : http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier
Decision tree is classification strategy as opposed to the algorithm for classification. It takes top down approach and uses divide and conquer method to arrive at decision. We can have multiple leaf classes with this approach.
In next part we shall discuss the k-nearest neighbours algorithm and implement a small code again using sklearn library. We shall explore tuning parameters, three different approaches for k-nearest neighbour selection.
If you liked this post, share with your interest group, friends and colleagues. Comment down your thoughts, opinions and feedback below. I would love to hear from you. Follow machine-learning-101 for regular updates. Don’t forget to click the heart(❤) icon. You can write to me at email@example.com . Peace.