A primer for…Random Forests

Chapter 2

Isaac Asamoah
Analytics Vidhya
15 min readJun 6, 2021

--

Decisions, decisions. Photo by Jens Lelie on Unsplash

Let’s get coding!

If you want to grow your own decision tree in python as you read through —Jump into this Colab notebook that has all of the code we’ll walk through today. It also has code to download the sklearn Boston dataset to help you think through a real problem. In the Boston data, this would mean using a random forest to predict the median house price given certain factors. For example, what is the median house price in an area with an average of 4 rooms per dwelling and a per capita crime rate of 4.8? All set? Let’s get into it!

Previously on A primer for…Random Forests

Decision trees

The most important component of a random forest is the decision tree. As we talked about earlier, a random forest is simply a collection of decision trees with some extra rules about how the trees are grown.

Nodes: A decision tree is made up of nodes. Referring back to our 30 second summary, a node is a rule that helps create the regions of the predictor space defined by a decision tree. Let’s explore this more with an example. Remember our toy dataset?

--Code--
display(toy_data)
Our toy data.

Imagine “a” and “c” are predictors of “y”. To create a node, we need to select a predictor and a value to define a rule for this node. An example of such a rule would be “a < 44”. A node with this rule would create two regions in our predictor space:

--Code--
display(toy_data[toy_data['a'] < 44])
display(toy_data[toy_data['a'] >= 44])
Our toy data, split by the root node into two branches.

This is our root node, it splits the training data in two, based on its rule. To create a whole decision tree we simply repeat this process on each of the “branches” until we reach a terminal node. The branches are the two subsets of data created by the split. They are created by applying the rule for the node (in this case a< 44) to the data, giving us the two subsets of data above. The terminal nodes are sometimes called leaves. They are the outermost nodes of your tree and they are defined by a set of stopping criteria. These criteria ensure we don’t try to extract too much information from the training data. Remember our definition of a decision tree?

A decision tree involves segmenting the predictor space into simple regions. In order to make a prediction for a given observation, we take the mean (for regression), or the mode (for classification), of the training observations in the region to which it belongs¹.

Imagine splitting your data until you have no data left to split. The region (subset of the data) you use for your prediction would contain at most one data point. Is the mean or mode of one data point useful for prediction? No. No its not. So we define some criteria to make sure we define subsets of the data with minimal variance, but enough data points for a useful prediction. We’ll go through these criteria in some detail a bit later.

Let’s think about what a node would look like in our sklearn Boston dataset. Open it up, by running the “Let’s use some real data” cell in the collab notebook linked above if you would like to look at the data while we muse on it.

Let’s say we asses the data and we find the best first split is < 3 rooms per dwelling on average. This means that separating data with less than 3 rooms per dwelling on average from data with 3 or more rooms per dwelling on average minimises the variance of the median house price within these subsets. Why is this important?

Because if the variance within each subset is reduced, it means the variance between the subsets is increased. Put another way, it means average numbers of rooms per dwelling may have an impact on median house price. We capture this knowledge in our decision tree by defining the node average rooms per dwelling < 3. Our next node gives us more knowledge about what might impact median house price, and so on, until we have a tree that enables us to predict the median house price!

Cost Function: But, how do we select the right rules for the nodes? We need some criteria. Let’s think about the purpose of splitting the data. If the prediction we produce with the tree is the mean of the final subset to which the observation of interest belongs, we want the observations in the final split to be as similar as possible. That is, we want low variance.

It follows that the criteria for selecting rules for each node needs to minimise var(y[p < s]) + var(y[p >= s]).

Where y is the outcome variable, p is a predictor and s is a the value at which the predictor is split. This give us the combined variance of the two datasets defined by the node’s splitting rule. For those less familiar with statistics, variance measures the spread of the data around the mean. Greater variance means a bigger range of values. So our cost function helps us determine which split of the data will give us the lowest spread in the subsets created by the split.

Let’s try this with our toy_data. First we define a function to calculate var(y[p < s]) + var(y[p >= s]).

--Code--
def combined_RSS(y1, y2):

‘’’calculate the combined residual sum of squares for y1 and y2‘’’

return np.sum((y1 - np.mean(y1))**2) + np.sum((y2 -
np.mean(y2))**2)

Pretty simple right? Just calculate the variance for each branch and add them together. In our Boston dataset example, this means we are looking at the variance of the median house prices for the houses with less than 3 rooms per dwelling on average + the variance of the median house prices for the houses with 3 or more rooms per dwelling on average. Why do we do this? The sum of the variances gives us a metric to find the best split — the lowest sum of variances. This takes us one step closer to a model that estimates the median house price.

Now we need to generate all the possible splits to assess with the criteria.

--Code--
def get_feature_splits(y, f):

'''split the y variable by split s in feature f'''

return ([{'feature' : f.name, 'split' : s, 'left': y[f < s],
'right' : y[f >= s]} for s in f])

Here we create a list of dictionaries for each feature-split combination, where each entry contains a feature, a split and the left and right branches created by the split. For example, the list of features would include per capita crime rate and average rooms per dwelling. Possible splits are each value of these features — 3 rooms, 4 rooms, 5.4 crime rate per capita, 7.6 crime rate per capita, etc.

Next, we need to calculate the variance for each possible split.

--Code--
def cost_function(feature_split):

'''calculate the total RSS for the splits defined by splitting
feature f at split s'''

return ({'feature' : feature_split['feature'],
'split' : feature_split['split'],
'cost' : combined_RSS(feature_split['left'],
feature_split['right'])
}
)

You can see we again created a dictionary for each feature-split combination, but this time we included the cost of the split by calling our cost function combined_RSS. Now we know the variance of each split, we’re ready to select the best one!

Node Selection: Finally we need to select the feature split with the minimum variance — this is our first node.

--Code--
import math
import random as rand
def flatten_list(l):
return [item for sublist in l for item in sublist]

def get_split_costs(df):

'''calculate the cost for each split'''

rand.seed(0)
return flatten_list(
[[cost_function(split) for split in get_feature_splits(
df.iloc[:,-1], df.iloc[:,:-1].loc[:, feature])] for
feature in rand.sample(list(df.iloc[:,:-1].columns),
int(math.sqrt(len(list(df.iloc[:,:-1].columns)))))])
def select_node(split_costs):

'''select the lowest cost node'''

rand.seed(0)
return (rand.choice(
[{'feature' : split['feature'],
'split' : split['split'],
'cost' : split['cost']} for split in split_costs
if split['cost'] == np.min([split['cost'] for split in
split_costs])]))

There is a bit more going on here, let’s break it down. The first function is just a helper to flatten lists. Because we are iterating through a list of lists in get_split_costs - all splits for all features - we end up with a nested list that is awkward to work with, so we flatten it. The second function get_split_cost executes cost_function on the all the feature splits generated by get_feature_splits. Because this is a random forest tree this function also limits the eligible features at each split to a random sample. You can see that we have the square root of the number of features as the size of our feature random sample.

Limiting the number of features that can be selected is super important to reduce correlation. We talked about selecting a random sample of features earlier as the defining attribute of the random forest. But, imagine for a moment, you took a random sample (without replacement), where the sample size was equal to the total number of features. You end up back where you started with a bagged tree! We need to limit the number of features that can be randomly selected. The implication of this decision is the strongest predictor may not even be considered for many of the splits¹. As you might have guessed, growing trees based on different features, split in different ways, leads to less correlation among the trees — the secret weapon of the random forest!

Finally, select_node scans all the split variances and selects the split with the lowest variance. The rand.choice function selects a feature randomly if 2 or more features have the same variance, this is unlikely as our dataset grows but happens in our tiny toy dataset. And we have our first node! Let's run it on toy_data and see what we get.

--Code--
root_node = select_node(get_split_costs(toy_data))
display(root_node)
--Output--
{'feature': 'c', 'split': 10, 'cost': 4268.222222222222}
Subsets of the Toy data created by the root node.

For the toy_data, the root node is defined by the rule c < 10 and the combined variances of the branches is 4268.22. Our toy data is split in two according to the rule c < 10. Its easy to see this split isn’t super useful since our right branch is just one observation, we’ll talk about how to handle this in the next section. Once we increase the size of our sample, and use real data, our splits will start to look better. To prove this, let’s find the root node for our sklearn Boston dataset.

--Code--# Setup our Boston dataset
from sklearn import datasets
y = datasets.load_boston()['target']
X = datasets.load_boston()['data']
description = datasets.load_boston()['DESCR']
columns = datasets.load_boston()['feature_names']
data = pd.DataFrame(X)
data.columns = columns
data.loc[:, 'y'] = y
train = data.sample(frac = 0.7, random_state = 0)
# Find our root node!
root_node_boston = select_node(get_split_costs(train))
display(root_node_boston)
--Output--
{'feature': 'LSTAT', 'split': 9.74, 'cost': 15747.269329572293}

The subsets created by the split are too big to show here, but let’s look at how many values fall into each subset.

--Code--
display(len(train[train['LSTAT'] < 9.74]))
display(len(train[train['LSTAT'] >= 9.74]))
--Output--
148
206
Where LSTAT: % lower status of the population

As you can see, a much more useful split with 148 and 206 values. This root node is telling us the feature and split value that creates the lowest variance in median house price within the resulting subsets of the data is LSTAT (lower status of the population) at 5.49%. And there we have it — our root node!

Terminal nodes: Once we have the root node, we simply repeat the above process on each branch until a terminal node is reached. What makes a node terminal? We need some more criteria!

Stopping criteria: It turns out defining criteria for a terminal node is really simple. Tree depth or node size. Tree depth refers to the number of times the training data has been split. Node size is the number of observations in a given node. That is, the number of observations that satisfy the rule defining the node.

Maximum Tree Depth: So to define the terminal node, first we set a maximum tree depth, which controls tree complexity. The more complex the tree is, the more it is influenced by the training data. This can lead to overfitting. Overfitting leads to high model variance. Generally we want to keep variance low, while also managing model bias. A bias model isn’t influenced enough by the training data. So we want a balance of the two — a model that is influenced just enough by the training data. But “just enough” is tricky to define! We want to glean as much information as we can from the training data, without compromising the model’s usefulness when applied to data it has never seen before.

Minimum Node Size: Next we set a minimum node size. Because we take the mean (or mode in classification) of the terminal node, we don’t want the node size to be too small, otherwise the mean/mode will not be very useful. Also if there is only one or two observations in our terminal node, we have probably followed the training data too closely and the model will suffer from overfitting. Let’s avoid that!

Recursion: Ok, so no we now how to stop. How do we efficiently repeat the branching process? We use recursion. Recursion is just repetition, but it can be tricky because it is a process that repeats itself. So the logic can get slippery. I will not go into recursion deeply here, suffice to say its worth looking to if you want to build a decision tree. Here is a good place to start your recursive journey. Just to make it a little trickier, we will be using tree based recursion, because we are building a tree. The article above introduces you to general recursion and tree base recursion, so have a read of that and I will see you when you are done.

So, you have crushed recursion (or have decided you don’t care how recursion works!), let’s grow a decision tree.

A recursive process has two parts, the base case and the recursive case.

The stopping criteria tells us which to apply. When one of the stopping criteria is met, return the base case, if not, return the recursive case. Let’s do it!

First, let’s write a helper function to handle the base case — its really simple.

--Code--
def terminal_node(node_y):

'''return the prediction for the terminal node'''

return np.mean(node_y)

Done. Return the mean. Now, the fun part!

--Code--
def grow_tree(df, node, depth, max_depth = 2, min_size = 3):

''' recursively grow a decision tree by applying the function to
each node it returns until a stopping criteria is met'''

left = df.loc[df.loc[:, node['feature']] < node['split']]
right = df.loc[df.loc[:, node['feature']] >= node['split']]

if left.empty or right.empty:
return terminal_node(
list(left.iloc[:, -1]) + list(right.iloc[:, -1]))

elif depth >= max_depth:
return {'node': node,
'left': terminal_node(left.iloc[:, -1]),
'right': terminal_node(right.iloc[:, -1])}

else:
return {'node' : node,

'left' : (lambda x: terminal_node(list(x.iloc[:, -1]))
if len(x.iloc[:, -1]) <= min_size
else grow_tree(x, select_node(get_split_costs(x)),
depth + 1, max_depth,
min_size))(left),

'right' : (lambda x: terminal_node(list(x.iloc[:, -1]))
if len(x.iloc[:, -1]) <= min_size
else grow_tree(x,select_node(get_split_costs(x)),
depth + 1, max_depth,
min_size))(right)
}

Base Case: Let’s look at this function step by step. The first statements are just naming the left and right node of the first split (the node input to this function is the result of the select_node function). Next is our first base case. I say first because we have 3 possible stopping conditions, which gives us three scenarios in which we return terminal node, which is our base case output.

So our first base case criteria is no split possible. We haven’t discussed this yet, and it is pretty simple — the split results in one branch rather than two. This means all the values of the input either satisfy the rule or all values do not satisfy it. If that happens we return the mean of the node. Done. If you are wondering why we apply the terminal node function to the sum of left and right, it enables us to complete the calculation in one if statement rather than two.

The second base case criteria is max depth. You can see we have depth and max depth as inputs. If depth is greater or equal to max depth, return the current left and right nodes as terminal nodes.

The third base case criteria is embedded in the recursive case. This is an artefact of tree based recursion. The third criteria applies to each of the branches created by the input node separately. We simply check the size of each node, if its less than or equal to min_size, return the terminal node for that branch.

To understand why we apply the criteria separately, think about the case where the left branch is less than the minimum size but the right is not.

So that is all of our base cases, on to the recursive case!

Recursive Case: The recursive case says “If neither of the outer stopping criteria are met and the inner criteria is not met, return yourself with the depth increased by one.” This is done for the left branch and the right branch separately. That’s it.

To understand why this approach works, let’s work through one recursion. If you have read the recursion article referenced above, recall the discussion about the call stack. That is the key to understanding why this function works. It was for me anyway!

Imagine we call grow_tree((df = toy_data, node = root_node, depth = 1, max_depth = 2, min_size = 3 )on our toy data (below) and our root node {‘feature’: ‘c’, ‘split’: 10, ‘cost’: 4268.222222222222}

Root node applied to our toy data.

This function is the first layer in our stack!. Let’s evaluate the stopping criteria: if left.empty or right.empty is false, depth >= max_depth is false and the left node has more than 3 values, we need to call the function on our left branch. This function is the next layer of our stack. Let’s work through it!

--Code--
left = toy_data.loc[toy_data.loc[:, root_node['feature']] <
root_node['split']]
grow_tree(left, select_node(get_split_costs(left)), 2, 2, 3)--Output--
{'node': {'feature': 'c', 'split': 2, 'cost': 3842.875},
'left': 69.0,
'right': 47.125}

The first base criteria:if left.empty or right.empty is false, but depth >= max_depth is true! We return the best split for left, i.e. select_node(get_split_costs(left)), with its left and right terminal nodes. That stack is popped!

Now we move to the next step in our base stack — the right branch. The same thing happens, we add a stack and evaluate the criteria.

--Code--
right = toy_data.loc[toy_data.loc[:, root_node['feature']] >=
root_node['split']]
grow_tree(right, select_node(get_split_costs(right)), 2, 2, 3)
--Output--
20.0

The first base case condition if left.empty or right.empty is true! We pop this stack and return the mean of the node. There is only one value in the node (because our data is so tiny), so that's really easy!

Now we are a back at our base stack! We pop it and return our brand new depth 2 decision tree. Nice work!

Let’s run our grow_tree function all the way through and inspect our first decision tree!!

--Code--
grow_tree(df = toy_data, node = root_node, depth = 1, max_depth = 2, min_size = 3)
--Output--
{'node': {'feature': 'c', 'split': 10, 'cost': 4268.222222222222},
'left': {'node': {'feature': 'c', 'split': 2, 'cost': 3842.875},
'left': 69.0,
'right': 47.125},
'right': 20.0}

Now let’s see how this looks on real data.

--Code--
root_node_boston = select_node(get_split_costs(train))grow_tree(df = train, node = root_node_boston, depth = 1, max_depth = 2, min_size = 3)
--Output--
{'node': {'feature': 'LSTAT', 'split': 9.74, 'cost': 15747},
'left': {'node': {'feature': 'LSTAT','split': 4.67,'cost': 6432},
'left': 39.27,
'right': 26.19},
'right': {'node': {'feature': 'LSTAT','split': 16.14,'cost': 3082},
'left': 20.20,
'right': 14.35}}

And there we have it, a recursive function to grow a decision tree. We have our first building block! Next up, in chapter three we are going to bag this decision tree to give us a forest….and then we’ll make it random… Can’t wait!

In the meantime — try growing your own decision tree. Write your own code or use this notebook (It includes all the code we’ve worked through here) as a starting point. Try it out on some public datasets. Who know what secrets you will find!

Notes
I have used rand.seed(0) to make sure decision results are re-producible. Try removing it and you'll see that you may get a different decision tree from time to time. As the number of features increases, this variation will increase.
I have also used rand.seed(0) when selecting my train set from the Boston data set for the same reason.If you're not able to reproduce the results - please let me know!If you use google collab, note that it seems to re-order the dict outputs by default. That is the dicts don't display in the order intended. I haven't figured out how to fix it yet. If you figure it out, or have come across this before - let me know!References1. Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani. An Introduction to Statistical Learning : with Applications in R. New York :Springer, 2013.

--

--

Isaac Asamoah
Analytics Vidhya

Data science, being a person and pop culture t-shirts.