Splitting Data by Asking Questions: decision trees
Recommendation systems are one of the most common and exciting applications in machine learning. Ever wonder how Netflix recommends movies, YouTube guesses which videos you may watch, or Amazon shows you products you might be interested in buying? These are all examples of recommendation systems. One simple and interesting way to see recommendation problems is to consider them classification problems. Let’s start with an easy example: our very own app-recommendation system using decision trees.
Take 40% off Grokking Machine Learning by entering fccserrano into the discount code box at checkout at manning.com.
This blog post is an excerpt from the “Decision Trees” chapter of my book Grokking Machine Learning. A similar example also appears in this video:
The problem: We need to recommend apps to users according to what they’re likely to download
Recommendation systems are one of the most common and exciting applications in machine learning. Ever wonder how Netflix recommends movies, YouTube guesses which videos you may watch, or Amazon shows you products you might be interested in buying? These are all examples of recommendation systems. One simple and interesting way to see recommendation problems is to consider them classification problems. Let’s start with an easy example: our very own app-recommendation system using decision trees.
Let’s say we want to build a system that recommends to users which app to download among the following options. We have the following three apps in our store (figure 1):
- Atom Count: an app that counts the number of atoms in your body
- Beehive Finder: an app that maps your location and finds the closest beehives
- Check Mate Mate: an app for finding Australian chess players
The training data is a table with the platform used by the user (iPhone or Android), their age, and the app they have downloaded (in real life there are many more platforms, but for simplicity we’ll assume that these are the only two options). Our table contains six people, as shown in table 1.
Given this table, which app would you recommend to each of the following three customers?
- Customer 1: a 13-year-old iPhone user
- Customer 2: a 28-year-old iPhone user
- Customer 3: a 34-year-old Android user
What we should do follows:
- Customer 1: a 13-year-old iPhone user. To this customer, we should recommend Atom Count, because it seems (looking at the three customers in their teens) that young people tend to down- load Atom Count.
- Customer 2: a 28-year-old iPhone user. To this customer, we should recommend Check Mate Mate, because looking at the two iPhone users in the dataset (aged 25 and 35), they both down- loaded Check Mate Mate.
- Customer 3: a 34-year-old Android user. To this customer, we should recommend Beehive Finder, because there is one Android user in the dataset who is 32 years old, and they downloaded Beehive Finder.
However, going customer by customer seems like a tedious job. Next, we’ll build a decision tree to take care of all customers at once.
The solution: Building an app recommendation system
In this section, we see how to build an app-recommendation system using decision trees. In a nutshell, the algorithm to build a decision tree follows:
- Figure out which of the data is the most useful to decide which app to recommend.
- This feature splits the data into two smaller datasets.
- Repeat processes 1 and 2 for each of the two smaller datasets.
In other words, what we do is decide which of the two features (platform or age) is more successful at determining which app the users will download and pick this one as our root of the deci- sion tree. Then, we iterate over the branches, always picking the most determining feature for the data in that branch, thus building our decision tree.
First step to build the model: Asking the best question
The first step to build our model is to figure out the most useful feature: in other words, the most useful question to ask. First, let’s simplify our data a little bit. Let’s call everyone under 20 years old “Young” and everyone 20 or older “Adult”. Our modified dataset is shown in table 2.
The building blocks of decision trees are questions of the form “Does the user use an iPhone?” or “Is the user young?” We need one of these to use as our root of the tree. Which one should we pick? We should pick the one that best determines the app they downloaded. To decide which question is better at this, let’s compare them.
First question: Does the user use an iPhone or an Android?
This question splits the users into two groups, the iPhone users and Android users. Each group has three users in it. But we need to keep track of which app each user downloaded. A quick look at table 2 helps us notice the following:
- Of the iPhone users, one downloaded Atom Count and two downloaded Check Mate Mate.
- Of the Android users, two downloaded Atom Count and one downloaded Beehive Finder. The resulting decision stump is shown in figure 2.
Now let’s see what happens if we split them by age.
Second question: Is the user young or adult?
This question splits the users into two groups, the young and the adult. Again, each group has three users in it. A quick look at table 2 helps us notice what each user downloaded, as follows:
- The young users all downloaded Atom Count.
- Of the adult users, two downloaded Atom Count and one downloaded Beehive Finder.
The resulting decision stump is shown in figure 3.
From looking at figures 2 and 3, which one looks like a better split? It seems that the second one (based on age) is better, because it has picked up on the fact that all three young people downloaded Atom Count. But we need the computer to figure out that age is a better feature, so we’ll give it some numbers to compare. In this section, we learn three ways to compare these two splits: accuracy, Gini impurity, and entropy. Let’s start with the first one: accuracy.
Accuracy — How often is our model correct?
Accuracy is the most popular metric for the effectiveness of a model. It is the fraction of correctly classified data points over the total number of data points. If you’d like a small recap, please check this video.
Suppose that we are allowed only one question, and with that one question, we must deter- mine which app to recommend to our users. We have the following two classifiers:
- Classifier 1: asks the question “What platform do you use?” and from there, determines what app to recommend
- Classifier 2: asks the question “What is your age?” and from there, determines what app to recommend
Let’s look more carefully at the classifiers. The key observation follows: if we must recommend an app by asking only one question, our best bet is to look at all the people who answered with the same answer and recommend the most common app among them.
Classifier 1: What platform do you use?
- If the answer is “iPhone,” then we notice that of the iPhone users, the majority downloaded Check Mate Mate. Therefore, we recommend Check Mate Mate to all the iPhone users. We are correct two times out of three.
- If the answer is “Android,” then we notice that of the Android users, the majority downloaded Atom Count, so that is the one we recommend to all the Android users. We are correct two times out of three.
- Classifier 2: What is your age?
- If the answer is “young,” then we notice that all the young people downloaded Atom Count, so that is the recommendation we make. We are correct three times out of three.
- If the answer is “adult,” then we notice that of the adults, the majority downloaded Check Mate Mate, so we recommend that one. We are correct two times out of three.
Notice that classifier 1 is correct four times out of six, and classifier 2 is correct five times out of six. Therefore, for this dataset, classifier 2 is better. In figure 4, you can see the two classifiers with their accuracy. Notice that the questions are reworded so that they have yes-or-no answers, which doesn’t change the classifiers or the outcome.
Gini impurity index: How diverse is my dataset?
The Gini impurity index, or Gini index, is another way we can compare the platform and age splits. If you’d like to learn more about the Gini Index, you can continue reading, or check out this video:
The Gini index is a measure of diversity in a dataset. In other words, if we have a set in which all the elements are similar, this set has a low Gini index, and if all the elements are different, it has a large Gini index. For clarity, consider the following two sets of 10 colored balls (where any two balls of the same color are indistinguishable):
- Set 1: eight red balls, two blue balls
- Set 2: four red balls, three blue balls, two yellow balls, one green ball
Set 1 looks more pure than set 2, because set 1 contains mostly red balls and a couple of blue ones, whereas set 2 has many different colors. Next, we devise a measure of impurity that assigns a low value to set 1 and a high value to set 2. This measure of impurity relies on probability. Consider the following question:
If we pick two random elements of the set, what is the probability that they have a different color? The two elements don’t need to be distinct; we are allowed to pick the same element twice.
For set 1, this probability is low, because the balls in the set have similar colors. For set 2, this probability is high, because the set is diverse, and if we pick two balls, they’re likely to be of different colors. Let’s calculate these probabilities. First, notice that by the law of complementary probabilities, the probability that we pick two balls of different colors is 1 minus the probability that we pick two balls of the same color:
P(picking two balls of different color) = 1 — P(picking two balls of the same color)
Now let’s calculate the probability that we pick two balls of the same color. Consider a general set, where the balls have n colors. Let’s call them color 1, color 2, all the way up to color n. Because the two balls must be of one of the n colors, the probability of picking two balls of the same color is the sum of probabilities of picking two balls of each of the n colors:
P(picking two balls of the same color) = P(both balls are color 1) + P(both balls are color 2) + … + P(both balls are color n)
What we used here is the sum rule for disjoint probabilities, that states the following:
Sum rule for disjoint probabilities If two events E and F are disjoint, namely, they never occur at the same time, then the probability of either one of them happening (the union of the events) is the sum of the probabilities of each of the events. In other words,
P(E ∪ F) = P(E) + P(F)
Now, let’s calculate the probability that two balls have the same color, for each of the colors. Notice that we’re picking each ball completely independently from the others. Therefore, by the product rule for independent probabilities, the
probability that both balls have color 1 is the square of the probability that we pick one ball and it is of color 1. In general, if p_i is the probability that we pick a random ball and it is of color i, then
Putting all these formulas together, we get that
This last formula is the Gini index of the set.
Finally, the probability that we pick a random ball of color i is the number of balls of color i divided by the total number of balls. This leads to the formal definition of the Gini index.
Gini impurity index In a set with m elements and n classes, with ai elements belonging to the i-th class, the Gini impurity index is
where p_i = a_i/m . This can be interpreted as the probability that if we pick two random elements
out of the set, they belong to different classes.
Now we can calculate the Gini index for both of our sets. For clarity, the calculation of the
Gini index for set 1 is illustrated in figure 5 (with red and blue replaced by black and white).
Set 1: {red, red, red, red, red, red, red, red, blue, blue} (eight red balls, two blue balls)
Notice that, indeed, the Gini index of set 1 is larger than that of set 2.
How do we use the Gini index to decide which of the two ways to split the data (age or platform) is better? Clearly, if we can split the data into two purer datasets, we have performed a better split. Thus, let’s calculate the Gini index of the set of labels of each of the leaves. Looking at figure 5, here are the labels of the leaves (where we abbreviate each app by the first letter in its name):
Classifier 1 (by platform):
- Left leaf (iPhone): {A, C, C}
- Right leaf (Android): {A, A, B}
Classifier 2 (by age):
- Left leaf (young): {A, A, A}
- Right leaf (adult): {B, C, C}
The Gini indices of the sets {A,C,C}, {A,A,B}, and {B,C,C} are all the same: 1–1-(2/3)² — (1/3)² = 0.444.
The Gini index of the set {A, A, A} is 1-(3/3)² = 0.
Classifier 1 (by platform):
Average Gini index = 12 (0.444+0.444) = 0.444
Classifier 2 (by age):
Average Gini index = 12 (0.444+0) = 0.222.
We conclude that the second split is better, because it has a lower average Gini index.
Aside The Gini impurity index should not be confused with the Gini coefficient. The Gini coefficient is used in statistics to calculate the income or wealth inequality in countries. In this post, whenever we talk about the Gini index, we are referring to the Gini impurity index.
Entropy and information gain
Entropy and information gain have beautiful combinatorial properties, and they have useful applications in information theory. If you’d like to learn more about it, I wrote a longer blog post with an accompanying video.
Classes of different sizes? No problem: We can take weighted averages
In the previous sections we learned how to perform the best possible split by minimizing average Gini impurity index or entropy. However, imagine that you have a dataset with eight data points (which when training the decision tree, we also refer to as samples), and you split it into two datasets of sizes six and two. As you may imagine, the larger dataset should count for more in the calculations of Gini impurity index or entropy. Therefore, instead of considering the average, we consider the weighted average, where at each leaf, we assign the proportion of points correspond- ing to that leaf. Thus, in this case, we would weigh the first Gini impurity index (or entropy) by 6/8, and the second one by 2/8. Figure 6 shows an example of a weighted average Gini impurity index and a weighted average entropy for a sample split.
Second step to build the model: Iterating
In the previous section, we learned how to split the data in the best possible way using one of the features. That is the bulk of the training process of a decision tree. All that is left to finish build- ing our decision tree is to iterate on this step many times. In this section we learn how to do this.
Using the three methods, accuracy, Gini index, and entropy, we decided that the best split was made using the “age” feature. Once we make this split, our dataset is divided into two datasets. The split into these two datasets, with their accuracy, Gini index, and entropy, is illustrated in figure 7.
Notice that the dataset on the left is pure — all the labels are the same, its accuracy is 100%, and its Gini index and entropy are both 0. There’s nothing more we can do to split this dataset or to improve the classifications. Thus, this node becomes a leaf node, and when we get to that leaf, we return the prediction “Atom Count.”
The dataset on the right can still be divided, because it has two labels: “Beehive Count” and “Check Mate Mate.” We’ve used the age feature already, so let’s try using the platform feature. It turns out that we’re in luck, because the Android user downloaded Beehive Count, and the two iPhone users downloaded Check Mate Mate. Therefore, we can split this leaf using the platform feature and obtain the decision node shown in figure 8.
After this split, we are done, because we can’t improve our splits any further. Thus, we obtain the tree in figure 9.
This is the end of our process, and we have built a decision tree that classifies our entire dataset.
I hope you enjoyed this blog post! If you want to see more of the book, check it out on Manning’s liveBook platform here.