Newbie’s Guide to ML — Part 5

We’ve looked at the k-NN classifier for classifying a point into one of the four quadrants. In the coming few posts we’ll look at a new type of classifier called decision tree classifier.


What is a decision tree?

A decision tree is a flowchart-like tree structure. If you follow the tree from the root to the leaf, you’ll find the class label. All the non-leaf nodes represent some form of test on the data we wish to classify and every leaf node represents a class label. For our spam filter example, the tree could look something like this:

Every non-leaf node that performs a test on the attribute partitions the data set. This partitioning is called “splitting”. There are a number of attribute selection measures which can be used to select the best attribute to split the data on. Different algorithms use different measures and produce either a binary or a non-binary tree.

There are a number of algorithms which can be used to create decision trees such as ID3, C4.5, CART, etc. ID3 produces non-binary trees whereas CART produces binary trees. ID3, C4.5, and CART follow a greedy approach to create trees in which trees are created in a top-down recursive divide-and-conquer approach. Since they follow a greedy approach, the generated solution is not always guaranteed to be optimal.

We’ll go over ID3 in detail when we begin writing code.


What is an attribute selection measure?

Just like there are a number of algorithms to make a decision tree, there are a number of attribute selection measures. An attribute selection measure is a heuristic that is used for selecting the best attribute to split the data set on.[1]

To oversimplify, it is a formula we apply to the data set that helps us choose what attribute to split on. The aim of each split is to make the data set purer.

A “pure” data set is a data set which contains data that belongs to just one class. The attribute chosen for the split is called the “splitting attribute”. Information gain and Gini index are two of the most common attribute selection measures.


What is tree pruning?

All large data sets contain some form of noise and / or outliers. An outlier is an abnormal data entry. Noisy data is data that contains errors. Both noise and outliers result in overfitting. In case of decision trees, overfitting results in creation of extra branches. Tree pruning attempts to identify and remove these branches. This makes the tree much smaller and classification faster.


That’s it for now. We’ll look at information gain in the next post.

Liked the post? Don’t forget to give it a ❤️.


Citations

[1] Han, Jiawei and Kamber, Micheline. Data Mining: Concepts and Techniques.