The Complete Guide to Random Forests: Part 2
The following article builds on The Complete Guide to Random Forests: Part 1. If you’re unfamiliar with Ensemble Learning, Decision Trees, and/or Random Forests, I highly recommend clicking the link and reading that article first.
Random Forests are a supervised ensemble learning technique consisting of many Decision Trees. The algorithm uses bagging and feature randomness when building each individual tree to try and create an uncorrelated forest of trees whose prediction by committee is more accurate than that of any individual trees. In a classification problem, each tree votes and the most popular class is chosen as the final result.
This article will teach you how to build a Random Forests Model with the RandomForestClassifier module from scikit-learn.
The following tutorial only covers simple data preparation (splitting the dataset) before moving on to build the Random Forest Model. If you’re interested in data preprocessing steps, such as imputing missing values and performing encoding, you can access full code documentation and commented explanations through this Kaggle Notebook.
Objective
The dataset used for this example is the Rain in Australia dataset put together by Kaggle user Joe Young. The purpose of this dataset is to train classification models on the target variable RainTomorrow with the goal of determining whether it will rain the next day or not. This variable has a value of Yes if the rainfall for that day exceeded 1 mm and No otherwise.
Based on the plot to the left, we can see that there is an imbalance of data — most days out of the year, it does not rain. However, this imbalance is not cause for immediate concern because it accurately reflects the nature of weather. Also, data imbalances were addressed in preprocessing using the SMOTE technique.
Data Preparation
We’ll start by splitting the processed data into training and test datasets. Don’t forget to import “train_test_split” from sklearn.model_selection!
Our proportions are 70% training and 30% test.
Building the Model
Next, we’ll build the model. This portion of the code uses the RandomForestClassifier from sklearn.ensemble.
Here we have the luxury of using 200 trees (as defined by n_estimators), which is one of the reasons our accuracy rate has a relatively high degree of correctness.
However, the Random Forest Model can be slow in generating predictions due to the sheer number of decision trees. This is because whenever a prediction is made, all the trees in the forest have to make a prediction for the same given input and then perform voting on it.
For this specific report, that’s totally fine, our dataset is small. But for large datasets that need to be processed quickly, a different algorithm or a smaller number of trees would need to be used.
Next, we’ll make our predictions.
Then, we’ll check our model’s accuracy.
As you can see, our model has an accuracy of about 90.5%, which isn’t bad — but let’s see if we can improve it.
Random Forest Feature Importance
The Random Forest module from scikit-learn has a really handy attribute called “feature importances”.
We used this in our code above to investigate which features are helping us determine days when it will rain, and which will not (you’ll need to make sure you’ve imported pandas to use this method).
You can see our output for this code to the right.
Feature Importance refers to various techniques used to calculate a score for all the input features of a model. The scores simply represent the “importance” of each feature.
Feature Importance is valuable for a variety of reasons. First, it helps you better understand the data that’s going into your model by showing you the relationship between the features and the target variable.
Second, in some cases it helps you improve your model’s performance through feature selection, which we’ll perform in the next section.
And finally, it helps you interpret your model. By calculating scores for each contributing feature, you’re able to more confidently communicate actionable insights to coworkers and stakeholders.
All of this is great, but it’s not super intuitive to interpret a list of numbers. Let’s look at the plot of feature importance for this first Random Forest Model.
Building the Model with Feature Selection
Next, we’ll build a second Random Forest Model that runs on only the features deemed important once Feature Importance was analyzed. Since WindDir9am, year, and RainToday all had Feature Importance scores of under 0.02, we’ll drop them from the model and see what happens.
Then, we’ll build a second Random Forest Model using the selected features and make predictions.
And our accuracy rate is:
We can see that accuracy goes down slightly. Thus we can conclude that while each feature has a differing level of helpfulness when it comes to determining whether it will rain tomorrow in Australia, it’s still necessary to include ALL features in order to achieve the highest possible degree of accuracy in this specific instance.
That’s a wrap! Thanks for reading- if you liked my tutorial-writing style, check out these other articles tacking similar lessons: