Newbie’s Guide to ML — Part 8

Something that I’ve been deferring until now is how to test a classifier. We’ll put the decision tree classifier that we built to a test. We’ll use the lenses data set to create a training set and a test set and calculate the training error.

What is a test set?

The training set used to train the classifier contains highly accurate data. When we want to separate it into a training and a test set, we’ll use the majority of it to train the classifier and keep the rest of it for testing.

The test set also contains labeled data points. We’ll input each of the points in the test set to the classifier trained on the training set and see if the label matches the one in the test set.

The error of the classifier, as we saw in earlier posts, is the probability that the classifier will misclassify a point. We’ll modify the existing code that we have to test our decision tree classifier. Let’s begin by creating the training set and the test set.

Creating the training set

Save this as a file named lenses-training.txt in the same folder as DecisionTree.groovy

Creating the test set

Save this as a file named lenses-test.txt in the same folder as DecisionTree.groovy. All I have done is take a small part of the original data set and use it as a test set.

Reading the data set

Notice that there’s a change in createDataSet(). It now needs the path of the file to read. We’ll use this function twice — once to read the training set and once to read the test set.

Testing the classifier

We read both the training and the test sets. We train our tree based on the training set and test every point in the test set against the generated tree. We keep a track of how many points are misclassified and print them out.

Recall that the error of the classifier is L=ℙ( h(x) f(x) ).

When we run the script we get:

We see that 1 out of 5 have been misclassified and thus the error of 0.2.

That’s it for this post. See you later.