Machine Learning: Decision Trees

Katie Liao
The Startup
Published in
4 min readNov 20, 2020

Decision tree is one of the widely used machine learning models. It is easier to interpreted and requires little data preparation.

Decision tree consists of nodes, branches and leaves that is grown using training data. Each node represent the feature and threshold that split data into internal nodes or leaves. Each leaf represent final outcome. The depth of tree is defined by the number of levels that does not include root nodes. It is a top down approach that group data with high similarity into same group (homogeneity). Groups are as different as possible with each other (heterogeneity). Starting at the root node, data are recursively split into subsets until all nodes consists of a single sample or reach a certain degree of similarity.

In each step, the best split is determined by getting maximum information gain (IG):

Information gain

Three are two common ways of calculating information:

  1. Gini impurity (Gini index):
Gini impurity formula

2. Entropy:

Entropy formula

This is a lot of formula… Lets look at an example.

This is my 5 year old puppy, oolong. I have been teaching him commands the day I adopted him. Not long into the process, I realized oolong is a very selective dog. The success rate are highly dependent on the types of command and rewards. Here is a little schematic showing 30 days training result

If I would like to build a decision tree model to predict the success rate, which one of the split is more effective? Lets do some calculations.

Using Gini impurity as criteria:

Using Entropy as criteria:

As a result, Gini impurity and entropy both show rewards with higher information gain, hence, its a more effective split.

Decision tree has many advantages but it also has few disadvantages. First, it creates bias tree if data is imbalance. The top of the tree will tend to separate majority into one side. One common way to work with imbalance data is to apply weight to each class. Randomly select data to ensure class is balanced or generate synthetic minority class by using different techniques such as k-nearest neighbor are other methods.

Second, a single decision tree has over fitting problems. This will reduce bias but potentially results into high variance. Getting right samples to features ratio is important to prevent over-fit. There are few techniques to reduce over-fitting:

  1. Setting minimum number of samples at a leaf node or maximum depth of tree
  2. Prune the tree to reduce the size of trees by removing sections of trees that provided little prediction power. There are two types of pruning. Pre-prune stop growing tree when information becomes unreliable while post-prune takes a fully grown tree and then remove leaf nodes only if it improves little model performance
  3. Ensemble method such as bagging or boosting. It combines several weak learner to create strong learners. Random Forest is an example for bagging that constructs multiple decision trees by using a randomly selected subset of training data and features. The output is calculated by averaging the individual decision tree predictions. AdaBoost is another example for boosting. Rather than a full grown tree, stumps (a weak learner using only little features) are grew. Different models are then generated sequentially. Newer model is created by learning older model’s error.

I hope you find this article helpful. We will talk about Random forest and AdaBoost more in different articles.

--

--