A Guide to Machine Learning in R for Beginners: Decision Trees

This is part 6 of my beginner’s series on Machine Learning in R

Parul Pandey
Aug 31, 2018 · 16 min read

Introduction

Joel Grus, in his book, Data Science from Scratch, has used a very interesting example to make his readers understand the concept of Decision Trees. Since the example is to perfect, I shall quote the same. He says — As children, how many of you remember playing the game of twenty questions? In this game, one child would think of an animal or a place or a famous personality, etc. Others would ask questions to guess it. The game would go something like this :

“I am thinking of an animal”

“Does it have more than five legs?”

“No”

“Is it delicious?”

“No”

“Does it appear on the back of the Australian 5 cent coin?”

“Yes”

“Is it an echidna?”

“Yes, it is!”

Now let’s create a little elaborate graph for “Guess the Animal “ game, we just played.

Source: Data Science from Scratch: First Principles with Python

This is exactly how we would create a Decision Tree for any Data Science Problem also. Now let us study in detail the math behind it.

What is a Decision Tree?

A Decision Tree is a supervised learning predictive model that uses a set of binary rules to calculate a target value.It is used for either classification (categorical target variable) or regression (continuous target variable). Hence, it is also known as CART (Classification & Regression Trees). Some real-life applications of Decision Trees include:

Structure of a Decision Tree

Decision trees have three main parts:

The root node is the starting point of the tree, and both root and terminal nodes contain questions or criteria to be answered. Each node typically has two or more nodes extending from it. For example, if the question in the first node requires a “yes” or “no” answer, there will be one leaf node for a “yes” response, and another node for “no.”

PC: Analytics VIdhya

The Algorithm behind Decision Trees.

The algorithm of the decision tree models works by repeatedly partitioning the data into multiple sub-spaces so that the outcomes in each final sub-space is as homogeneous as possible. This approach is technically called recursive partitioning. The produced result consists of a set of rules used for predicting the outcome variable, which can be either:

The decision rules generated by the CART (Classification & Regression Trees) predictive model are generally visualized as a binary tree.

Let’s look at an example to understand it better. The plot below shows a sample data for two independent variables, x, and y, and each data point is colored by the outcome variable, red or grey.

CART tries to split this data into subsets so that each subset is as pure or homogeneous as possible. The first three splits that CART would create are shown here.

If a new observation fell into any of the subsets, it would now be decided by the majority of the observations in that particular subset.

Let us now see how a Decision Tree algorithm generates a TREE. The tree for the splits we just generated is shown below.

The third split checks whether or not the variable x is less than 85. If yes, then the model says to predict red, and if no, the model says to predict grey.

Advantages of Decision Trees

Predictions from Decision Trees

In the above example, In the above example, we discussed Classification trees i.e when the output is a factor/category:red or gray.Trees can also be used for regression where the output at each leaf of the tree is no longer a category, but a number. They are called Regression Trees.

Classification Trees:

With Classification Trees, we report the average outcome at each leaf of our tree. However, Instead of just taking the majority outcome to be the prediction, we can compute the percentage of data in a subset of each type of outcome.

Let us understand it through the same example that we used above.

The above dataset has been split into four subsets.

Predictions for Subset 1:

Regression Trees:

To predict outcome in such cases, since we have continuous output variables, we simply report the average of the values at that leaf. For example, if we had the values 3, 4, and 5 at one of the leaves, we will just take the average i.e 4.

Let us see it graphically.

In the above graph:

y = Outcome/target variable i.e variable we are trying to predict

x = Independent variable

Firstly, Let’s fit a linear regression to this data set . By doing so , we obtain a line .

As is quite evident, linear regression does not do very well on this data set.

However, we can notice a very interesting feature. The data lies in three different groups. If we draw lines here, we see x is either less than 10, between 10 and 20, or greater then 20.

We recall that Decision Trees can fit in this this kind of of problem easily. So if splits are at:

Measures Used for Split

There are different ways to control how many splits are generated.

  1. Gini Index: It is the measure of inequality of distribution. It says if we select two items from a population at random then they must be of same class and probability for this is 1 if population is pure.

Process to calculate Gini Measure:

where P(j) is the Probability of Class j

2. Entropy : Entropy is a way to measure impurity.

Less impure node require less information to describe them and more impure node require more information. If the sample is completely homogeneous, then the entropy is zero and if the sample is an equally divided one, it has entropy of one.

3. Information Gain : Information Gain is simply a mathematical way to capture the amount of information one gains(or reduction in randomness) by picking a particular attribute.

In a decision algorithm, we start at the tree root and split the data on the feature that results in the largest information gain (IG). In other words, IG tells us how important a given attribute is.

The Information Gain (IG) can be defined as follows:

Where I could be entropy or Gini index. D(p), D(Left) and D(Right) are the dataset of the parent, left and right child node.

In R, a parameter that controls this is minbucket. The smaller it is, the more splits will be generated However, If it is too small, overfitting will occur. And, if it is too large, model will be too simple and accuracy will be poor


Decision Trees in R

We will be working on the famous Boston housing dataset.This data comes from a paper, “Hedonic Housing Prices and the Demand for Clean Air” exploring the relationship between prices and clean air in the late 1970s.We will explore the boston.csv data set with the aid of trees. Here we are interested in building a model initially of how prices vary by location across a region.

Dataset

We will explore the boston.csv data set with the aid of trees. Download this file from here to follow along. Each entry of the dataset corresponds to a census tract As a result, there are multiple census tracts :

LON and LAT are the longitude and latitude of the center of the census tract.
MEDV is the median value of owner-occupied homes, measured
in thousands of dollars.
CRIM is the per capita crime rate.
ZN is related to how much of the land is zoned for large residential properties.
INDUS is the proportion of the area used for industry.
CHAS is 1 if a census tract is next to the Charles River else 0
NOX is the concentration of nitrous oxides in the air, a measure of air pollution.
RM is the average number of rooms per dwelling.
AGE is the proportion of owner-occupied units built before 1940.
DIS is a measure of how far the tract is from centres of employment in Boston.
RAD is a measure of closeness to important highways.
TAX is the property tax per $10,000 of value.
PTRATIO is the pupil to teacher ratio by town.

here MEDV is the output /target variable i.e price of the house to be predicted. Since the output variable is continuous it is an example of a regression tree.

Working

1. Analyzing the Data

> boston = read.csv('boston.csv')> str(boston)'data.frame': 506 obs. of  16 variables:
$ TOWN : Factor w/ 92 levels "Arlington","Ashland",..: 54 77 77 46 46 46 69 69 69 69 ...
$ TRACT : int 2011 2021 2022 2031 2032 2033 2041 2042 2043 2044 ...
$ LON : num -71 -71 -70.9 -70.9 -70.9 ...
$ LAT : num 42.3 42.3 42.3 42.3 42.3 ...
$ MEDV : num 24 21.6 34.7 33.4 36.2 28.7 22.9 22.1 16.5 18.9 ...
$ CRIM : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
$ ZN : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
$ INDUS : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
$ CHAS : int 0 0 0 0 0 0 0 0 0 0 ...
$ NOX : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
$ RM : num 6.58 6.42 7.18 7 7.15 ...
$ AGE : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
$ DIS : num 4.09 4.97 4.97 6.06 6.06 ...
$ RAD : int 1 2 2 3 3 3 5 5 5 5 ...
$ TAX : int 296 242 242 222 222 222 311 311 311 311 ...
$ PTRATIO: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...

There are 506 observations corresponding to 506 census tracts in the Greater Boston area. We are interested in building a model of how prices vary by location across a region. So, let’s first see how the points are laid out.
Using the plot commands, we can plot the latitude and longitude
of each of our census tracts.

Let’s first see how the points are laid out using the plot commands.

# Plot observations
> plot(boston$LON, boston$LAT)

The dense central core of points
corresponds to Boston city, and other urban cities.
Since, we also have the Charles river attribute(CHAS), we want to also show all the points that lie along the Charles River in a blue colour.

We can do this by the points command.

# Tracts alongside the Charles River> points(boston$LON[boston$CHAS==1], boston$LAT[boston$CHAS==1], col="blue", pch=19)

Now we have plotted the tracks in Boston along Charles river.

What other things can we do?

Well, this data set was originally constructed to investigate questions about how air pollution affects prices.
The air pollution variable in the data is NOX. Let’s have a look at a distribution of NOX.

# Plot pollution/NOX > summary(boston$NOX) Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.3850 0.4490 0.5380 0.5547 0.6240 0.8710

The minimum value is 0.385 and the maximum value is 0.87 .The median
and the mean are about 0.53, 0.55. So, let’s just use the value of 0.55,
as it is the centre most value.

Let’s look at the tracts that have above-average pollution.

> points(boston$LON[boston$NOX>=0.55], boston$LAT[boston$NOX>=0.55], col="green", pch=20)

All the points that have got above-average pollution are coloured green.
Now it kind of makes sense, since, the area most densely polluted is the one which is also most densely populated.

Now let us look at how the prices vary over the area as well. We can do this with the help of MEDV variable using the same methodology as done when plotting the pollution.

# Plot prices
> plot(boston$LON, boston$LAT)
> summary(boston$MEDV)
Min. 1st Qu. Median Mean 3rd Qu. Max.
5.00 17.02 21.20 22.53 25.00 50.00
> points(boston$LON[boston$MEDV>=21.2], boston$LAT[boston$MEDV>=21.2], col="red", pch=20)
above median values graph

So what we see now are all the census tracts with above-average housing prices in red.
However, the census tracts of above-average and below-average are mixed in between each other.
But there are some patterns.
For example, look at that dense black bit in the middle. That corresponds to most of the city of Boston, especially the southern parts of the city.
So there’s definitely some structure to it, but it’s certainly not simple in relation to latitude and longitude at least.

2. Applying Linear Regression to the problem

Since, this is a regression problem as the target value to be predicted is continuous(house price), it is but natural that we look up to Linear Regression algorithm to solve the problem. We saw in the last graph that the house prices
were distributed over the area in an interesting way, certainly not the kind of linear way and we feel Linear Regression is not going to work very well here. Lets backup our intuition with facts.

Here we are plotting the relationship between latitude and house
prices
and the longitude and the house prices, which looks pretty nonlinear.

# Linear Regression using LAT and LON
> plot(boston$LAT, boston$MEDV)
> plot(boston$LON, boston$MEDV)

Linear Regression Model

> latlonlm = lm(MEDV ~ LAT + LON, data=boston)
> summary(latlonlm)
Call:
lm(formula = MEDV ~ LAT + LON, data = boston)
Residuals:
Min 1Q Median 3Q Max
-16.460 -5.590 -1.299 3.695 28.129
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3178.472 484.937 -6.554 1.39e-10 ***
LAT 8.046 6.327 1.272 0.204
LON -40.268 5.184 -7.768 4.50e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 8.693 on 503 degrees of freedom
Multiple R-squared: 0.1072, Adjusted R-squared: 0.1036
F-statistic: 30.19 on 2 and 503 DF, p-value: 4.159e-13

Let’s see how this linear regression model looks on a plot.So we shall plot the census tracts again and then plot the above-median house prices with bright red dots. The red dots will tell us the actual positions in Boston where houses are costly. We shall then test the same fact with Linear Regression predictions using blue $ sign,

# Visualize regression output
> plot(boston$LON, boston$LAT)
> points(boston$LON[boston$MEDV>=21.2], boston$LAT[boston$MEDV>=21.2], col="red", pch=20)
> latlonlm$fitted.values
> points(boston$LON[latlonlm$fitted.values >= 21.2], boston$LAT[latlonlm$fitted.values >= 21.2], col="blue", pch="$")

The linear regression model has plotted a dollar sign for every time it thinks the census tract is above median value. It’s almost a sharp line that the linear regression defines. Also notice the shape is almost vertical since the latitude variable was not very significant in the regression. The blue $ and the red dots do not overlap especially in the east.

It turns out, the linear regression model isn’t really doing a good job. And it has completely ignored everything to the right side of the picture. So that’s interesting and pretty wrong.

3. Applying Regression Trees to the problem

We’ll first load the rpart library and also install and load the rpart plotting library.

# Load CART packages
> library(rpart)
# install rpart package
> install.packages("rpart.plot")
> library(rpart.plot)

We will build a regression tree in the same way we would build a classification tree, using the rpart command.We would be predicting MEDV as a function of latitude and longitude, using the boston dataset.

# CART model
> latlontree = rpart(MEDV ~ LAT + LON, data=boston)
# Plot the tree using prp command defined in rpart.plot package
> prp(latlontree)
Regression Tree

The leaves of the tree are important.

Now, Let us visualise the output.We’ll again plot the points with above median prices just like in Linear Regression

# Visualize output> plot(boston$LON, boston$LAT)
>points(boston$LON[boston$MEDV>=21.2],boston$LAT[boston$MEDV>=21.2], col="red", pch=20)

The above plot is of actual known prices.It is the same plot that we observed with red dots. We want to predict what the tree thinks is above median house price. So we’ll name those values as fitted values which can be obtained from using the predict command on the tree we just built.

> fittedvalues = predict(latlontree)
>points(boston$LON[fittedvalues>21.2],boston$LAT[fittedvalues>=21.2], col="blue", pch="$")

The fitted values are greater than 21.2, the colour is blue, and the character is a $ to signify Price

The Regression tree has done a much better job and has kind of overlapped the red dots. It has left the low value area in Boston
out, and has correctly managed to classify some of those points
in the bottom right and top right.
`

But the tree obtained was very complicated and was overfitted.
How to avoid overfitting? By changing the minbucket size.
So let’s build a new tree using the rpart command again.

Simplifying Tree by increasing minbucket


> latlontree = rpart(MEDV ~ LAT + LON, data=boston, minbucket=50)
> plot(latlontree)
> text(latlontree)
tree with minbucket = 50

Here we have far fewer splits, and it’s far more interpretable.

We have seen that regression trees can do that we would never expect linear regression to do. Now let’s see how the regression trees will help us in building a predictive model and predicting house prices?

4. Prediction with Regression Trees

We’re going to try to predict house prices using all the variables we have available to us.

Steps:

# Split the data
> library(caTools)
> set.seed(123)
> split = sample.split(boston$MEDV, SplitRatio = 0.7)
> train = subset(boston, split==TRUE)
> test = subset(boston, split==FALSE)

Our training data is a subset of the boston data where the split is TRUE.
And the testing data is the subset of the boston data where the split is FALSE

# Create a CART model
> tree = rpart(MEDV ~ LAT + LON + CRIM + ZN + INDUS + CHAS + NOX + RM + AGE + DIS + RAD + TAX + PTRATIO, data=train)
> prp(tree)

Results:

Regression Tree Predictions

> tree.pred = predict(tree, newdata=test)
> tree.sse = sum((tree.pred - test$MEDV)^2)
> tree.sse
4328.988

Conclusion

Even though the Decision Trees appears to be working very well in certain conditions, it comes with its own perils.The model has very high chances of “over-fitting”. Infact it is the key challenge in case of Decision Trees. If no limit is set, in the worst case, it will end up putting each observation into a leaf node.

There are techniques which help in improving the performance of Decision Trees but we will learn about them but in the next article. We will learn to improve the results of our Decision Trees using the “cp” parameter. Also, we will try and implement Cross Validation, a technique to avoid overfitting in predictive models. Finally we will dive into Ensembling i.e combining the results of multiple models to solve a given prediction or classification problem.

So stay tuned for the next part.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Thanks to Prateek Joshi

Parul Pandey

Written by

Data Science+Community+Evangelism @H2O.ai | Linkedin: https://www.linkedin.com/in/parul-pandey-a5498975/

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade