Technical Deep Dive: Random Forests

Yu Chen
Panoramic
Published in
10 min readAug 3, 2019

--

Random Forests are one of the most popular machine learning models used by data scientists today. How they are actually implemented and the variety of use cases they can be applied to are often overlooked.

While this article will focus on the inner workings of Random Forests, we’ll start off by exploring the main problems this model solves.

The Bias Variance Tradeoff

One of the central tenets of statistics and machine learning is the concept of the Bias-Variance tradeoff, which states that as a machine learning model’s complexity increases, its bias (the average difference between its prediction and the true value) tends to decrease while the variance of its predictions will increase. This means for many models, we represent its overall error as

Decomposition of overall error into three components: 1) bias, 2) variance, and 3) irreducible error.

To illustrate some of the more technical concepts, we will utilize a Kaggle sample sales conversion dataset of Facebook ad campaigns contributed by an anonymous organization. Let’s use this dataset to see how the the breakdown of bias/variance affects the quality of our insights by exploring its relationship with model complexity. We’ll perform some initial preprocessing to arrive at our features (X) and target array (y):

--

--

Yu Chen
Panoramic

Software engineer focused on ML and distributed systems