Random Forests Explained Intuitively

6 min readFeb 1, 2019

Random forest is one of the most powerful machine learning technique because of its ability to work well on most of the datasets. It’s because of this reason it is the go-to model for most Kaggle users along with gradient boosting machines. This post explains how random forests work and how its parameters can be tuned to make it generalize better.

What is a random forest and how it’s different from classical machine learning models?

Before going to random forest, we need to understand what a decision tree is because they form a foundation for random forests.

A Decision Tree as the name implies a tree-like model which allows us to split the dataset into multiple small groups by taking simple decisions at each level. This will allow us to arrive at the final decision by walking down the tree. Decision Trees are produced by training algorithms, which identify how we can split the data in the best possible way.

It divides the feature space into a number of small regions and it does so by some set of splitting rules which can be summarised in form of a tree. So it’s called Decision Tree.

Decision Trees are different from other statistical models like linear regression. It has a few, if any, statistical assumptions. It does not assume that your data is normally distributed, the relationship is linear, or you have specified interactions and you don’t have to normalize the features for it to work well thus making it useful to model complex relationships.

Random forest is nothing but an ensemble of decision trees. This ensemble of Decision Trees is then used to predict the output value. They can perform classification and regression tasks. More details about how the ensemble is constructed are mentioned in Bagging section of this post.

Visualizing a single decision tree

For purpose of this post, all the visualizations will be made based on the famous Iris dataset. Here we try to predict the species of the flower using certain properties. The dataset includes three iris species with 50 samples each as well as some properties about each flower. In order to make it easier to visualize, only petal width and petal length features are used.

Given below is a small decision tree which tries to make predictions based on the petal length and width. So when we are given a new flower, we start at the root node(depth 0), the first decision which we have to make is to see if the petal length is smaller than 2.45 cm. If it is, then we move to root’s left child. Here it’s a leaf node and so it’s doesn’t have any more splits. We can simply assign the flower value 0 (Class 1).

Now assume that the petal length is greater than 2.45 then we move to root’s right child(depth 1, right), which is not a leaf node. So we need to make one more split, this time it’s based on petal width. If the petal width is less than 1.75 then we move to it’s left node(depth 2) and it’s a leaf node so we assign flower value 2 (class 3). Similarly, if the petal width is greater than 1.75 it will be assigned value 1 (Class 2, depth 2, right).

Iris Decision Tree. Darker color indicates a higher value.

The following figure shows the feature space is split into multiple regions. The thick vertical line represents the decision boundary of the root node (depth 0): petal length = 2.45 cm. Since the left area is pure (only class 0), it cannot be split any further. However, the right area is impure, so the depth-1 right node splits it at petal width = 1.75 cm (represented by the dashed line). Since max_depth was set to 2, the Decision Tree stops right there. However, if you set max_depth to 3, then the two depth-2 nodes would each add another decision boundary (represented by the dotted lines).

How do we make splits?

At any point in time a when making a split decision tree must decide two things. On which feature it must split and at what value. The way it’s going to do that is by taking all features one by one and trying out every possible value that feature can take and see if it results in a better split. To compare two splits it uses a score called information gain. There are a lot of ways in which information gain can be calculated like Gini, cross-entropy, root mean squared error, e.t.c.

The main idea behind all these different methods is to they try to split the dataset into two groups such that all the members in one group are very similar to each other(with respect to target variable) and very different from members of a different group. For example in the Iris classification example after the first decision, we made at the root node the data was divided into two groups. One group had points which all belong to class 0(similar to each other) whereas the other had a mix of class 1 and class 2(different compared to the first group).

Bagging

There is one more thing which we need to know to understand random forest clearly, that is a technique called as bagging. Bagging is a technique where we create multiple models which are not correlated which each other and each of which is capable of making predictions on its own. The final result will be an average of all those different models.

Random forest is just of way of bagging decision trees. Here we train a lot of decision trees by giving a random set of data to each tree. After the training phase, each tree is capable of predicting the final output on its own. How does it help improve the model? As we are training each tree on random data, each tree will have a different understanding of data and each one will make different errors. If we take the average of these trees each of which has been trained on a different random subset, the error will average out to zero and what is left is the true relationship and that’s the random forest.

Important parameters to tune

Following are some of the important parameters to make random forests generalize better and improve the overall model performance.

Max features

Random forests rely on the assumption that the average of a bunch of random errors is zero. But this is true only if the errors are not correlated. So less correlated our individual decision trees with each other, the better.

Imagine if we had a column that is far better than other columns in predicting the output, in that case, every tree which you build will start with that column. But there might be some interaction of variables where that interaction is more important than the individual column. So if every tree always splits on the same thing the first time, you will not get much variation in those trees.

To solve this problem along with taking different samples for building each tree, we consider only a subset of features at each individual split. This ensures that there is some variation in trees.

Value of 0.5 for max features indicates that only half of the features will be considered at every split point. Some common values which work well are 0.5, log2, sqrt.

Minimum samples for a split

By default, random forests continue training till each leaf has just one node. This may lead to overfitting. One way to solve this issue is by stopping the training when a leaf node has less than a certain number of samples. This number is called Minimum samples per split.

If we have minimum samples for a split as 2, then the model has to take one less level of decisions than it is supposed to make otherwise. So, for each tree instead of taking the value of one point, we take average here. It helps the model to generalize better even though it means that each tree is less powerful on its own.

The ideal value of this depends on the size of the dataset. But normally numbers like 1, 3, 10, 25 works well.