Random Forests are one of the most popular machine learning models used by data scientists today. How they are actually implemented and the variety of use cases they can be applied to are often overlooked.
While this article will focus on the inner workings of Random Forests, we’ll start off by exploring the main problems this model solves.
The Bias Variance Tradeoff
One of the central tenets of statistics and machine learning is the concept of the Bias-Variance tradeoff, which states that as a machine learning model’s complexity increases, its bias (the average difference between its prediction and the true value) tends to decrease while the variance of its predictions will increase. This means for many models, we represent its overall error as
To illustrate some of the more technical concepts, we will utilize a Kaggle sample sales conversion dataset of Facebook ad campaigns contributed by an anonymous organization. Let’s use this dataset to see how the the breakdown of bias/variance affects the quality of our insights by exploring its relationship with model complexity. We’ll perform some initial preprocessing to arrive at our features (X) and target array (y):