Imagine you are out to buy a cell phone for yourself.
Shopkeeper asks,“How can I help you Ma’am?”
“I am looking for a cell phone”
“You are at the right place, we have over 300 different types of cell phones, what kind of phone would you like to buy today?”
Decision paralysis hits you, totally confused among so many choices of phones you go blank!
“Let me help you choose a phone ma’am. What screen size would you like?”
“Umm… larger than 5.9 inches”
“Perfect, and how about the camera?”
“Definitely more than 14 Megapixels”
“Alright, and any preferences on the processor?”
“I want a quad core processor with at least 1.2 GHz speed”
“Sure ma’am, I’ve got the perfect phone for you I am sure you will like this one” And he hands over a phone to you.
“Oh Thank You so much this one is good”
“Great choice ma’am, congratulations for your new phone”
What the shopkeeper just did was to help you walk through a decision tree to narrow down your choices. Pictorially it would look like —
The “Yes” and “No” arrows are branches and the questions are our nodes with leaves as a buy or Don’t buy decision and hence we call it a decision tree.
Decision trees are everywhere. Knowingly or unknowingly we use them every day. When you say, “If it’s raining, I will bring an umbrella,” you’ve just constructed a simple decision tree like this —
It is a very simplistic decision tree and doesn’t account for too many situations. For instance, what if it is windy and raining outside? You would rather take a rain jacket instead of an umbrella and your decision tree would look like —
What if it is snowing? or the wind is too strong? You can keep adding conditions to this tree and it can keep getting bigger (deeper) with more branches to handle more situations. Simple enough, right? Well, that is the power of decision trees! Decision trees are very simple and understandable. They allow you to see exactly how a particular decision is reached.
Decision trees can be used for both classification and regression tasks. They comprise a supervised learning algorithm like a Neural Network. Let’s see how classification and regression works for decision trees.
Decision Tree Classification
Let’s consider a classification task where you have to classify an Iris flower data set into 3 different categories based on values of some attributes — Iris Setosa, Iris Versicolor and Iris Virginica; Setosa, Versicolor and Virginica are different varieties of Iris flowers. There are some attribute values which differ among these. We will use their petal length and petal width as attributes or features of interest.
We can separate these varieties of iris flowers with two different lines into 3 boxy regions. We can separate Setosa from the other two with a horizontal line which corresponds to length = 2.45 cm. An Iris flower with petal length less than 2.45 cm is a Setosa variety. Above this line would either be a Versicolor or a Virginica.
Let‘s draw another line to separate the other two varieties. Versicolor and Virginica can be separated by a vertical line corresponding to petal width = 1.75 cm. Note that this condition is on top of the earlier condition on petal length.
Now we have our lines which divide the data into 3 different parts. The decision tree for this looks like —
Now when you see a new Iris flower just measure its petal length and width and run it down the decision tree and you can classify which variety of Iris is this. For example, an Iris flower you found while on a hike had petal length 2.48 cm and petal width 1.86 cm, your decision tree tells you that this is Iris Virginica since petal length >2.45 cm and petal width > 1.75 cm.
Decision trees are good at classification but as I said earlier they can also be used for regression tasks. How does a decision tree do regression?
Decision Tree Regression
Regression unlike classification predicts continuous values. You can read about a simple regression algorithm called linear regression for a primer on what is regression.
Regression works similar to classification in decision trees, we choose the values to partition our data set but instead of assigning class to a particular region or a partitioned area, we return the average of all the data points in that region. The average value minimizes the prediction error in a decision tree. An example would make it clearer.
Predicting rainfall for a particular season is a regression problem since rainfall is a continuous quantity. Given rainfall stats like in the figure below how can a decision tree predict rainfall value for a specific season?
A decision tree to predict rainfall for a quarter would just return the average value for that quarter.
A variant of a decision tree can also fit a line to each of the section (quarter) which looks like —
Decision trees which return the linear fit are usually more prone to overfitting specially in regions with less data points.
So far so good. Our decision tree is making predictions of continuous values and classifications as well. But being a supervised learning algorithm how does it learn to do so; in other words how do we build a decision tree? Who tells the tree to pick a particular attribute first and then another attribute and then yet another? How does the decision tree know when to stop branching further? Just like how we train a neural network before using it for making predictions we have to train (build) a decision tree before prediction.
Decision Tree Learning
While building a decision tree it is very important to ask the right questions at the correct stage in a tree. That is what essentially building a decision tree or decision tree learning means.
There are a lot of algorithms which are employed to build a decision tree, ID3, C4.5, C5.0, CART to name a few but at their core all of them tell us what questions to ask and when.
Lets run through an example to understand how a decision tree learns. The below table has color and diameter of a fruit and the label tells the name of the fruit. How do we build a decision tree to classify the fruits?
Here is how we will build the tree. We will start with a node which will ask a true or false question to split the data into two. The two resulting nodes will each ask a true or false question again to split the data further and so on.
There are 2 main things to consider with the above approach -
- Which is the best question to ask at each node
- When do we stop splitting the data further
Lets start building the tree with the first or the top most node. There is a list of possible questions which can be asked. The first node can ask the following questions —
Is the color green?
Is the color yellow?
Is the color red?
Is the diameter ≥ 3?
Is the diameter ≥ 1?
Of these possible set of questions, which one is the best to ask so that our data is split into two sets after the first node? Remember we are trying to split or classify our data into separate classes. Our question should be such that our data is partitioned into as unmixed or pure classes as possible. An impure set or class here refers to one which has many different types of objects for example if we ask the question for the above data, “Is the color green?” our data will be split into two sets one of which will be pure the other will have a mixed set of labels. If we assign a label to a mixed set we have higher chances of being incorrect. But how do we measure this impurity?
Gini impurity is a metric of measuring the mix of a set.The value of Gini Impurity lies between 0 and 1 and it quantifies the uncertainty at a node in a tree. Lets see what that means. Imagine two bowls — one filled with apples and the other filled with labels. If you randomly pick an object from the left bowl and assign a label from the right bowl you have no chances of being incorrect therefore the impurity is 0.
On the other hand if you have a bowl with a mixture of objects and you pick an object randomly and assign a label to it you have a 1 out of 5 chance of being correct and hence 4 out of 5 chances of being incorrect which is also the Gini Impurity value.
So Gini Impurity tells us how mixed up or impure a set is. Now our goal with classification is to split or partition our data into as pure or unmixed sets as possible. If we reach at a 0 Gini Impurity value we stop dividing the tree further. So which question should we ask so that our data splits into as pure sets as possible?
Information gain is another metric which tells us how much a question unmixes the labels at a node. Mathematically it is just a difference between impurity values before splitting the data at a node and the weighted average of the impurity after the split. For instance, if we go back to our data of apples, lemons and grapes and ask the question “Is the color Green?”
The information gain by asking this question is 0.144. Similarly we can ask another question from the set of possible questions split the data and compute information gain.
The question where we have the highest information gain “Is diameter ≥ 3?” is the best question to ask. Note that the information gain is same for the question “Is the color red?” we just picked the first one at random.
Repeating the same method at the child node we can complete the tree. Note that no further questions can be asked which would increase the information gain.
Also note that the rightmost leaf which says 50% Apple & 50% lemon means that this class cannot be divided further and this branch can tell an apple or a lemon with 50% probability. For the grape and apple branches we stop asking further questions since the Gini Impurity is 0 for those.
This algorithm we used above to build the decision tree is called CART (Classification And Regression Trees). There are other algorithms which use different techniques to build a tree but the idea remains the same. What are the best questions to ask and when.
The above introduction hopefully has elicited some curiosity though it just scratches the surface of decision trees. I would encourage you to go and explore further. Read about them and maybe code a decision tree yourself. There are a lot of resources on the internet to help you with that.
There are lot of techniques and other algorithms used to tune decision trees and to avoid overfitting, like pruning. Bagging and boosting are few techniques which when combined with decision trees make them very powerful classification tools. Although, decision trees are usually unstable which means a small change in the data can lead to huge changes in the optimal tree structure yet their simplicity makes them a strong candidate for a wide range of applications. Before neural networks became popular, decision trees were the state of the art algorithm in Machine Learning. Several other ensemble models like Random Forests are much more powerful than vanilla decision tree. We will talk about them in future articles.
Decision trees are powerful because of their simplicity and interpretability. Decision trees and random forests are highly used in user signup modelling, credit scoring, failure prediction, medical diagnostics etc. in the industry and as I said earlier we use it knowingly or unknowingly in our day to day lives as well.
End Credits —
A great blog by Berkeley on Ensemble models https://ml.berkeley.edu/blog/2017/12/26/tutorial-5/
Nicely written introduction on Decision Trees
Great video by google to get started on Decision tree coding
X8 aims to organize and build a community for AI that not only is open source but also looks at the ethical and political aspects of it. More such simplified AI concepts will follow. If you liked this or have some feedback or follow-up questions please comment below.
Thanks for Reading!