Random Forest 101

by Brooke Kennedy

Opex Analytics
The Opex Analytics Blog
5 min readMar 22, 2019

--

The Opex 101 series is an introduction to the tools, trends, and techniques that will help you make the most of your organization’s data. Intended for business leaders and practitioners alike, these posts can help guide your analytics journey by demystifying essential topics in data science and operations research.

The random forest is one of the most popular and versatile machine learning algorithms out there today. If you’re a professional who works with or near data science teams, you’ve probably heard of it, but you may not totally understand what it is or why it’s so loved by data scientists. In this blog, I’m going to break down what a random forest is, how it works, and what makes it so great.

The Basics

The random forest is one of many supervised machine learning algorithms that can be used to predict an outcome, whether that outcome is a number (a prediction process we call regression) or represents membership in a group (known as classification). It’s a non-parametric model, and thus does not make any strong assumptions about the data distribution, or specify any parameters (like the slope of a linear regression line).

Random forest also uses the ensemble learning method, meaning it combines many underlying models into one, using all the individual models’ predictions together. The models it combines together are called decision trees (hint: where the ‘forest’ name comes from). Consequently, to truly understand a random forest, we must first understand the component models that form its foundation.

At a basic level, we can think about decision trees as flowcharts. Each node in a tree represents a variable in the data, as well as a split in that variable’s values. These splits tell us which downstream branch to choose as we traverse the tree. Following these splits in the data will eventually lead us to the leaf nodes, the bottom-most points in the tree, which represent our predictions. The example below demonstrates how one might choose a transportation method through a simple decision tree.

Each split in the decision tree has the goal of dividing the data into distinct subsets that contain similar data points. The metric we use to measure the effectiveness of these splits is actually just called “information gain,” because good splits help us learn more about our data.

For example, imagine we are trying to predict a binary outcome, where either outcome is equally likely — a 50/50 ratio of the two outcomes overall. A split that results in all the data points on each side having the original 50/50 ratio won’t help us gain any information about the outcome. A better split might make one side 70/30 and the other side 30/70, such that the data is meaningfully segregated according to the outcome, which will ultimately result in a better prediction the further down the tree you go.

Though they are very easily visualized and conceptually clear, decision trees are considered to be ‘weak learners’ due to the fact that they are relatively simple models, and may perform only slightly better than chance.

A Multitude of Trees

Now that we have established the concept of a decision tree, we can examine how multiple decision trees are combined to become a random forest through a process called “bootstrap aggregation,” also known as bagging. In this process, random forests generate new datasets by uniformly sampling with replacement from our original dataset. Not only does the algorithm train each individual decision tree on these perturbed samples, but it also uses a random subset of features at each node in each tree to determine which variable to use for a split.

By using both the subsampled datasets and subsampled features, random forests are able to combat overfitting by making the underlying decision trees less correlated, thereby allowing the model to generalize better to new data and new situations. After training all of these decision trees in parallel, the random forest just averages their results (or takes the majority vote, in the case of classification) to give us the final prediction.

The Bottom Line

Random forests are able to combine the appealing properties of decision trees, including robustness to outliers and the ability to handle mixed data types, while still improving upon their undesirable aspects. Random forests are less susceptible to overfitting, and therefore they tend to generalize better and predict with greater accuracy.

However, there are a few downsides to using this great algorithm. In regression, random forests work best within the numerical bounds of the data they’ve seen, and are unable to predict outside the range of their training data. In addition, random forests are not easy to visualize like decision trees, since they are composed of many trees themselves (though they do provide variable importance rankings that help illustrate how the model makes decisions).

Due to its many and varied strengths, however, the random forest algorithm is often a great place to start when working on a prediction problem.

If you enjoyed this Opex 101 entry, check out our recent post on Vehicle Routing Problems for another down-to-earth explanation of a fundamental concept in operations research and data science.

If there’s a topic you’d like us to cover as part of Opex 101, let us know in the comments below!

_________________________________________________________________

If you liked this blog post, check out more of our work, follow us on social media (Twitter, LinkedIn, and Facebook), or join us for our free monthly Academy webinars.

--

--