A Career in Data Science — Part 3 — Machine Learning — Decision Trees
We make hundreds, maybe thousands of decisions everyday. Tiny insignificant decisions like “ Should i carry a red or a blue handkerchief ?” to huge decisions like “ Should i buy this bike? ”. Every time you make a decision you unintentionally deplete your energy and create stress. This is not the case for machines as they acknowledge the best possible outcome using probability estimates.
So, if you have been following my blog, this my third article on Machine Learning on Decision Trees. If you are here for the first time, below is the link to my Introductory post and hope you enjoy the content on my blog and cherish this journey of learning.
Fun Fact :
44th President of the United States, Barack Obama, in a Vanity Fair article, said, “ You need to remove from your life the day-to-day problems that absorb most people for meaningful parts of their day… You’ll see I wear only gray or blue suits. I’m trying to pare down decisions. I don’t want to make decisions about what I’m eating or wearing. Because I have too many other decisions to make. You need to focus your decision-making energy. You need to routinize yourself. You can’t be going through the day distracted by trivia. ”
Decision Trees, its just as easy as it sounds. An intuition of this might really help you understand better :
Suppose, you and your friend decide to create an application(App) to play a small game based on luck, for the previously built social networking website in “ Machine Learning — Perceptrons ” . You design an app which lets you play a game of Rock, Paper & Scissors. The winner is chosen based on the following conditions. If both opt for the same i.e. If both choose Rock or Paper or Scissors, the winner cannot be decided and the game will restart.
This simple rule based game, is a classical example for a decision based system. The questions as you can see will get more and more educated as the system narrows down the Winner. This is the exact same thing decision trees do, they ask you questions and questions about the data until they gather adequate information well enough to make a prediction. So, lets dive deeper into them and learn how they work and how they get built.
Decision Trees can broadly be classified into :
- Classification Trees ( Yes/No )
What we’ve seen above is an example of classification tree, where the outcome was a variable like ‘Win’ or ‘Lose’. Here the variable is Categorical.
- Regression Trees ( Continuous numbers)
Here the decision or the outcome is a variable that is Continuous, e.g. a number like 13.
I guess the intuition must have projected some brief idea on, What a decision tree is! So, Let us see how decision trees work. This example, might help gain an in-depth understanding of how it works, rather than the old school definitions. If you have a clear idea on Entropy and Information Gain. I can boldly affirm you have understood how this algorithm works.
Now, in order to go further with Decision Trees, we need to learn an important concept called Entropy. Entropy comes from physics and to explain it, I shall use the 3 states of water. These are Ice (solid), Water (liquid), Water vapor (gas).
When we look at the particles constituting each of these given states of matter of water. Ice is pretty rigid, i.e. the particles do not have much space to move around freely, they mostly stay where they are. Water is a little less rigid when compared to ice, the particles are not closely packed, there can be some movement. Interestingly, the particles constituting the water vapor have the most freedom in movement, when compared to the other states of matter of water.
Entropy measures precisely this, How much freedom does a particle have to move around? Thus the Entropy of Ice is low, the Entropy of Liquid Water is medium and the Entropy of Water Vapor is high!
The notion of entropy can also work in Probability. Suppose I give you 3 baskets of fruits.
The first basket has 4 apples. The second basket has 3 apples and 1 orange. While, the third basket has 2 apples and 2 oranges.
Let’s assume that these fruits are completely indistinguishable. We could say that entropy is given by how much the fruits are allowed to move around if we put them in a line.
We can see that the First basket is very very rigid, no matter how we organize the fruits, we always get the same state, i.e. 4 apples, so it has “Low Entropy”.
From the Second basket, we can re-organize the fruits in 4 different ways, so it has “Medium Entropy”.
While, the fruits from the Third basket can be re-organised in 6 different ways, so it has “High Entropy”.
This is not the exact definition of Entropy, but it gives us an idea, that the more Rigid the set is or the more Homogeneous, the Less Entropy you’ll have and the vice-versa.
Another way to see Entropy is in measure of Knowledge, If we were to pick a random fruit from each of the baskets, How much do we know about the fruit? The first basket consists only Apples, we know for sure that the fruit is an Apple. So, we have a “High Knowledge”. The second basket consists of 3 apples and 1 orange, we know that its very likely that the fruit picked can be an Apple and not very likely to be an Orange. So, if we bet that it’s an Apple, we’ll be right most of the time. Therefore, we have “Medium Knowledge” about the fruit that is picked. While, in the third basket, we know much less, since its equally likely to get an Apple and an Orange. So, here we have “Low Knowledge”.
It turns out that Knowledge and Entropy are opposites. The more Knowledge one has, the Less Entropy and vice-versa. Thus, we conclude that the First Basket has Low Entropy, the Second Basket has Medium Entropy and the Third basket has High Entropy.
Well, I guess now your understanding on Decision Trees might be much better although I haven’t introduced any MATH yet! :)
I believe, you still have the 3 baskets of fruits I gave you. Let’s use that to cook up a formula for entropy :
First Basket — Apple, Apple, Apple, Apple
- How likely is it to get the above mentioned order ? Well, to get the First fruit as an Apple, the probability is actually 1. Same thing for the Second fruit, as well as the Third and the Final fruit cause this basket only contains Apples. Since we put the fruit back in the basket after every draw, these events are completely independent. So, the probability that they all occur is the product of the four probabilities. This means our probability is 1, which matches our intuition that no matter what we do, we’ll always pick Apple, Apple, Apple & Apple.
Second Basket — Apple, Apple, Apple, Orange
- How likely is it to get the above mentioned order for this basket of fruits ? To get the First fruit as an Apple, the probability is 0.75 (3 out of the 4 fruits are Apples). Same thing for the Second fruit as well as the Third. While to pick an Orange as the Final fruit, the probability is 0.25 (1 out of the 4 fruits is an Orange). Therefore again, since the events are completely independent, the probability of the four of them happening is the product of the four probabilities. (0.75 x 0.75 x 0.75 x 0.25 = 0.105)
Third Basket — Apple, Apple, Orange, Orange
- Similarly, How likely is it to get this order for this basket of fruits ? We know that it is equally likely to any fruit from this final basket of fruits. To get the First & the Second fruit as an Apple, the probability is 0.50 (2 out of the 4 fruits are Apples). Also, the probability to get Oranges as the Third and the Fourth fruit is 0.5 (2 out of the 4 fruits are Oranges). Since the events are independent, the probability of the four of them happening is the product of the four individual events. (0.5 x 0.5 x 0.5 x 0.5 = 0.0625)
NOTE #1 : Product of individual probabilities for larger number of events might not be appropriate because, when you end up multiplying 1000 numbers between 0 to 1 , the end result will be very insignificantly tiny. Therefore, lets use logarithms over these products which is —
log(ab) = log (a) + log (b)
NOTE #2 : Since the logarithms of the numbers between 0 to 1 is negative, by considering negative of the whole equation we shall be dealing with positive numbers.
After applying -log() on the probabilities, we get :
#1st Basket : -[log(1) + log(1) + log(1)+ log(1)] = 0 + 0 + 0 + 0 = 0
#2nd Basket : -[log(0.75) + log(0.75) + log(0.75) + log(0.25)] = 0.415 + 0.415 + 0.415 + 2 = 3.24
#3rd Basket : -[log(0.5) + log(0.5) + log(0.5) + log(0.5)] = 1 + 1 + 1 + 1 = 4
NOTE #3 : We need to divide the result by 4, because what we will use as the definition of Entropy is, the average of the negatives of the logarithms of the probabilities of picking the fruits in the required way.
Thus, for the First basket we get 0 Entropy, 0.81 for the Second and 1 for the Third. The Mathematical approach matches with our initial estimation of the Entropy of each given basket of fruits.
In a more generalized way, the Entropy for M red balls and N blue balls can be calculates as :
Now, we know what Entropy is and how it can be calculated. Together with this next concept, Information Gain, we should be able to understand Decision Trees in depth. Also, get to know how this algorithm can be represented as a Tree.
Information Gain :
Okay, so now, let’s go ahead and build a Decision Tree. Just before that, Let me add some Papayas to the baskets of fruits, I gave you earlier. I guess you realize that, its a-lot of fruits for one person. Though, it’s never a harm to donate excess food/fruits to the people nearby, rather than let it rot and waste. It is estimated that A Person dies of hunger or hunger-related causes every Ten seconds. Sadly, it is Children who die most often.
So, You decide to donate the excess fruits and make a note of it. Using this tabular data, lets Build of Decision Tree. Our algorithm will be very simple. Look at the possible splits that each column gives, calculate the information gain and pick the Largest One.
Lets calculate the Entropy for this Data, which is for 3 people picking Apples, 2 of them picking Oranges and 1 picking the Papayas.
-(3/6)log(3/6)-(2/6)log(2/6)-(1/6)log(1/6) = 1.46
# Now, If we split the data based on Gender, we get two sets :
- First, 1 person picking Papayas and the other 2 persons picking Apples :
Entropy of this set is -(1/3)log(1/3)-(2/3)log(2/3) = 0.92
- Second, 1 person picking Apples and the other 2 persons picking Oranges :
The same as the other set,
Entropy of this set is -(1/3)log(1/3)-(2/3)log(2/3) = 0.92
The average of the Entropy of this split is 0.92
Therefore, Information Gain = 1.46 – 0.92 = 0.54
# If we split the data based on Occupation, we get two sets :
- First, 3 of them picking Apples :
Entropy of this set is -(3/3)log(3/3) = 0
- Second, 1 person picking Papayas and the other 2 of them picking Oranges :
Entropy of this set is -(1/3)log(1/3)-(2/3)log(2/3) = 0.92
The average of the Entropy of this split is 0.46
Therefore, Information Gain = 1.46–0.46 = 1.00
So to summarize, splitting by the Gender column gave us an Information Gain of 0.54, & splitting by the Occupation gave us an Information Gain of 1. The Algorithm says, Pick the column with the Highest Information Gain, which is Occupation. We get :
This is our Decision Tree.
The next post will be on Naive Bayes. You can find the link below :
A Career in Data Science — Part 4 — Machine Learning — Naive Bayes
Naive bayes is actually naive indeed :)
Just as always I’d like to thank my readers so much for their time & interest. As I conclude this post on Decision Trees with one of my favorite quote :
“I am enough of an artist to draw freely upon my imagination. Imagination is more important than knowledge. Knowledge is limited. Imagination encircles the world.” ― Albert Einstein