The Complete Guide to Random Forests: Part 1

Rowan Curry
5 min readDec 16, 2021

--

photo thanks to Michael Benz

I hear the term “Random Forest” and immediately in my minds eye I’m walking through a redwood grove, smelling salty coastal air and inhaling the rich scents of a forest. However, Random Forests themselves are not quite as romantic as skipping through a redwood forest drenched in Bay Area fog.

Random Forests are a supervised, ensemble learning algorithm based on Decision Trees. If any of those terms are unfamiliar to you, don’t panic! This article, the first in a series of two, will take you through a simple, clear, and thorough explanation of all concepts involved in the making of Random Forests.

WHAT IS ENSEMBLE LEARNING?

Ensemble learning is a method where multiple machine learning algorithms are used at the same time. This strategy helps achieve a higher predictive performance than if you were to use an individual algorithm by itself. The reason this approach works so well is quite intuitive. Consider the following example:

Each student in a standard-size classroom is given the same tricky math problem, and each student is told to solve the math problem individually. Some students finish right away, some students take a really long time, and each solution turned in has varying degrees of correctness.

Now, consider the same situation, but the students are required to work together to solve the problem. The combined brainpower of a whole room of people most likely results in a well-thought out solution with a high degree of correctness.

Ensemble learning works in the same way. By combining the “brainpower” of multiple algorithms, we’re much more likely to build an effective and trustworthy model.

WHAT ARE THE TREES THAT MAKE UP THE FOREST?

Random Forests consist of a large number of individual decision trees that operate as an ensemble. Let’s dive further into the structure of an individual decision tree, so we can better understand the way Random Forests operate.

In the simplest possible terms: a Decision Tree asks a “question”, and then classifies the “person” based on their answer. Let’s consider the image below. This is an example of a simple Decision Tree that helps you decide what you should do with your day.

photo courtesy of Lorraine Li

The very top of the tree, where the question “Work to do?” has been placed, is called the root node. If a person answers yes to this question, the tree takes them to the shaded box that says “Stay in”. Since this box has lines traveling toward it, but not away from it, it’s called a leaf node, and denotes one of the five possible destinations for data traveling through this Decision Tree. All of the shaded boxes in this example are leaf nodes.

If a person answers no, they don’t have any work to do, then the tree directs them to a second question, “Outlook”, asking about the current weather. This second question is called an internal node. Internal nodes have lines traveling toward them and away from them. All non-shaded boxes in the above example are internal nodes.

To the right, we can see a decision tree example that includes both binary and numerical data at each internal node. We can begin to imagine how large, complex decision trees might look for large, complex data sets.

You might be wondering how each conditional statement is assigned to its corresponding internal node. We’ll start by discussing how to determine which conditional statement should live at the top of the tree (in the root node).

This is determined by looking at how well each possible conditional statement predicts the necessary variable, and then calculating the impurity for that outcome. If a conditional statement does not perfectly predict a label, it’s considered “impure”. We then calculate the relative impurities to determine which conditional statement has the least amount of incorrect results.

There are many ways to calculate impurity for decision trees, such as Gini Impurity, information gain, and gain ratio.If you’re interested in learning more about different ways to calculate impurity scores, check out this article.

After the root node has been chosen, the Decision Tree algorithm works its way down the tree using the same methodology for internal nodes until they each end in a leaf node.

And there you have it, Decision Trees!

AND NOW … THE MOMENT YOU’VE BEEN WAITING FOR … THE RANDOM FOREST ALGORITHM!

As stated previously, Random Forests are a supervised, ensemble learning algorithm based on Decision Trees. The Random Forest algorithm consists of many decision trees, and it uses bagging and feature randomness when building each individual tree. This creates an uncorrelated forest of trees whose prediction by committee is more accurate than that of any individual trees.

Bagging, which is short for bootstrap aggregating, is the process by which multiple models of the same learning algorithm are trained with bootstrapped samples of the original data set. The bootstrap sampling method is a resampling method that uses random sampling with replacement.

In a classification problem where Random Forest is used, each tree “votes” and the most popular class is chosen as the final result. In a regression problem where Random Forest is used, the average of all the tree outputs is considered to be the final result.

The steps of the Random Forest algorithm for classification can be described as follows.

  1. Select random samples from the dataset using bootstrap aggregating.
  2. Construct a Decision Tree for each sample, and save the prediction results from each tree.
  3. Calculate a vote for each predicted result.
  4. Determine which predicted result has the most votes. This becomes your final prediction.

While Random Forests are a highly accurate method due to the sheer number of Decision Trees involved in the algorithm, they can be slow in generating results for the exact same reason. Also, the model can be difficult to interpret when compared to a single Decision Tree.

However, Random Forests are still a better choice of algorithm (most of the time) due to their computational power, adeptness at handling missing values, and usefulness when it comes to relative feature importance.

And there you have it! Random Forests. Stay tuned for Part 2, where we’ll build our own classification model using the Random Forest algorithm.

--

--

Rowan Curry

Data Scientist. Very excited about all things data. All views are my own.