Random Forest — Master the 5 topics

Sammiti Yadav
Geek Culture
Published in
8 min readJan 19, 2023
RF image by author

Random Forest(RF) is the most commonly used machine learning algorithm in the data science world predominantly for classification problems. It is a supervised learning method that works on the Ensemble technique.

The ensemble technique is basically generating output by aggregating results from voting. More on ensemble further in the article.

But before that let’s see what topics will we learn in the course today and have a look at our case study dataset:

  1. Bagging
  2. Ensemble
  3. OOB
  4. Hyperparameters & Cross Validation
  5. Pros/Cons of RF
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

heart = pd.read_csv("heart.csv")
print(heart.head())

print(heart.info())

print(heart.describe())

OUTPUT:

Heart disease dataset
information of the dataset
Description of the dataset

It’s a heart disease dataset where we are trying to predict if someone can have a heart disease or not based on information like their age, gender, cholesterol level, bps, etc. The ‘target’ variable here is our dependent variable. The target variable seems balanced and the data doesn’t have missing values.

Let’s get to the working of the RF model.

Random Forest as the name suggests is a group of trees created by the randomization of samples. It means the algorithm creates many decision trees. Let's see how the decision trees are created.

See the below image:

Image by author

A dataset is split into many subsets of data by choosing samples at random. This is called the method of resampling. Further, if you observe the samples are repeated in the newly created datasets. Which means the data is resampled with replacement. It can be viewed as a ticket chosen from the box at random & then kept back again in the box. So it is natural that it can be chosen again. This whole method of resampling with replacement is called Bootstrap Aggregation or Bagging.

Resampling is done with rows as well as columns to make sure every tree is different. Once the data is bagged decision trees are generated by best-split criteria using either the Gini index or entropy techniques. More about it here → https://www.upgrad.com/blog/gini-index-for-decision-trees/

Now every created tree has given us a result (a classification in our example). The category which received the maximum votes is chosen & it becomes the final output of a random forest model. This technique of creating multiple models & taking maximum votes from their results or aggregating their results(for regression problems) is called an Ensemble technique. There are other ensemble techniques like Boosting, Stacking, etc., however, RF uses only bagging.

But why is the ensemble technique better? For that let’s understand the properties of the trees we create in the process.

The trees thus created has “low bias” & “high variance”. Low bias is because every tree is created without being pruned hence it fits the data well and sometimes overfits. High variance because it will give low accuracy on test data or unseen data hence there will be high variance between the two outputs. When the results are aggregated the output is more accurate because it doesn’t overfit anymore. Since the model doesn’t overfit there is a fair chance it performs well on the unseen data.

But how to make sure that the model is not overfitting?

  1. The decision trees should not be correlated. The trees should be very different from each other to make sure we don’t have duplicate trees else we won’t be able to get the complete benefit in the aggregation method.
  2. Every tree must be strong in performance by itself so that it’s reflected in the final model output.

Random Forest Model is said to not require a split sampling method meaning doesn't require dividing the data into “train” & “test” set for accuracy assessing purposes Why is that?

When the data is bagged, about 1/3rd or 36.8% of the samples are not chosen to build a model. So these samples can constitute an unseen dataset for testing accuracy. These samples which are left out or not bagged are called Out Of Bag(OOB) samples.

In the above picture, you will see samples 1, and 4 are not bagged from the original dataset. These samples would eventually be OOB samples.

See below the OOB score without splitting the dataset.


from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators = 100 , oob_score = True, random_state= 100, max_features = 4)
rf.fit(x, y)
rf.oob_score_

#OUTPUT
---->0.8184818481848185

For the sake of the case study, I have made it split into ‘train’ and ‘test’ because it will help for a comparative study with other models, but ideally won’t be required to fetch an OOB score. Code below:

from sklearn.model_selection import train_test_split

x_train, x_test, train_lables, test_lables = train_test_split(x, y, test_size=0.3, random_state=0)
rf = RandomForestClassifier(n_estimators = 100 , oob_score = True, random_state= 100, max_features = 4)
rf.fit(x_train, train_lables)
rf.oob_score_

#OUTPUT
----> 0.8113207547169812

Now let’s have a look at some important hyperparameters used to build RF classifier:

The two most important hyperparameters to focus on in RF are ‘n_estimators’ & ‘max_features’.

n_estimators parameter controls the number of trees inside the classifier. Just a few trees may not be sufficient for creating a vote bank on one hand & just too many trees are going to just take a long time to execute it thus increasing the time complexity. So we need just enough trees to do the trick. The default number of estimators is 100 in sci-kit-learn

max_features helps to find the number of features to take into account in order to make the best split. As discussed earlier RF resamples columns as well just as rows. Fewer columns may not give the best split and thus weaker trees & many columns may not give very different trees to make a difference in voting. So again we need just enough max_features. The convention for max_features is the square root of the total features, however, there are other methods like ‘auto’ or None, or ‘log2’ or just a number of your choice from your best intuition.

We can try a range of these parameters to see what OOB score we get on them and select a couple.

See the code below:

max_feature = [None, 'sqrt', 6]
n_tree = np.linspace(100, 500, 100)
df = pd.DataFrame(columns = ['OOB_score', 'n_estimators', 'max_feature'])
for n_col in max_feature:
for i in n_tree:
rfcl = RandomForestClassifier(n_estimators = int(i) , oob_score = True, random_state= 100, max_features = n_col)
rfcl.fit(x_train, train_lables)
df.loc[len(df)] = [rfcl.oob_score_, i, n_col]

df.fillna('None', inplace = True)
df.sort_values(by='OOB_score', ascending=False)

OUTPUT:

df OOB_score

df is sorted by descending values of OOB scores. You will see that

n_estimators = 487.87 as well as 104.04 gives the same score. max_features which gives the best score is sqrt (~3 in our case) as opposed to 6 which was the hard-coded number of our choice.

This can also be visualized. Please see below the code & graph:

unique = df['max_feature'].unique()
plt.figure(figsize=(7,7))

for i in unique:
df_new = df[df.max_feature == i]
col = df_new['OOB_score']
row = df_new['n_estimators']
plt.plot(row, col, label = i)

plt.xlim(100, 400)
plt.xlabel("n_estimators")
plt.ylabel("OOB score")
plt.legend(loc="upper right")
plt.show()

OUTPUT:

n_estimators/max_features vs OOB_score

It’s clearly visible that the trees created between the range 100–110 give the best score with maximum features used as the square root of the total.

Other hyper-parameters in the classifier:

max_depth: It specifies how deep the trees inside the forest can grow

min_samples_split: It specifies the minimum amount of samples an internal node must hold in order to split into further nodes

min_samples_leaf: It specifies the minimum amount of samples that a node must hold after getting split.

max_leaf_nodes: It sets a limit on the splitting of the node and thus helps to reduce the depth of the tree, and effectively helps in reducing overfitting

max_samples: This hyperparameter helps to choose the maximum number of samples from the training dataset to train each individual tree.

All these parameters can also be tuned to increase the accuracy of the model but as previously discussed RF solves the issue of overfitting by grouping the trees and resampling the data. I have not created a range of these and tested them. (But you can surely try!).

We will further do some more hyperparameter tuning but before that let’s understand the concept of Cross-Validation which will use.

Cross Validation(CV) is a model validation technique that uses different portions of data to test & train the model on different iterations. Just as we saw in OOB samples, some samples from the same dataset are used for testing and these samples change every iteration based on k folds of CV.

3-fold CV

More on CV here → https://scikit-learn.org/stable/modules/cross_validation.html

We will use cv along with the GridsearchCV method from sci-kit-learn.

More on it here → https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

Here I have also tried a couple of values of other hyperparameters to see which combination is chosen as the best. One of the arguments GridSearchCV takes is cv. The default value is 5, however, I have set it to 3 below example.

from sklearn.model_selection import GridSearchCV

param_grid = {
‘max_depth’: [7, 10],
‘max_features’: [‘sqrt’],
‘min_samples_leaf’: [20, 50],
‘min_samples_split’: [100, 150],
‘n_estimators’: [104, 108, 110]
}

rfcl2 = RandomForestClassifier(oob_score = True, random_state= 100)

grid_search = GridSearchCV(estimator = rfcl2, param_grid = param_grid, cv = 3)
grid_search.fit(x_train, train_lables)
best_grid = grid_search.best_estimator_

ytrain_predict = best_grid.predict(x_train)
ytest_predict = best_grid.predict(x_test)

print(grid_search.best_params_)
print(best_grid.oob_score_)

pd.DataFrame({'Feature_Imp.' :best_grid.feature_importances_} , index = x_train.columns)

#OUTPUT
----> {'max_depth': 7,
'max_features': 'sqrt',
'min_samples_leaf': 20,
'min_samples_split': 100,
'n_estimators': 104}

----> 0.8207547169811321
Feature Importance

There is another function in sci-kit learn which can be used without a grid search. It shows the score in every iteration (as below).

from sklearn.model_selection import cross_val_score

clf = grid_search.fit(x_train, train_lables)
scores = cross_val_score(clf, x_train, train_lables, cv=5)
scores

#OUTPUT
---> array([0.76744186, 0.86046512, 0.88095238, 0.73809524, 0.85714286])

Though we have assessed accuracy through OOB score we will see a few more metrics. This helps us in comparison of models.

from sklearn.metrics import classification_report
print(classification_report(test_lables,ytest_predict))

OUTPUT:

Metrics

The pros and cons were mostly covered during the course, however, I am jotting them down to mark completeness.

Pros:

  1. RF is not affected by outliers & missing values. it uses methods like binning & resampling to tackle those.
  2. RF can handle non-linear relationships well
  3. RF manages the bias-variance trade-off well by averaging or ensembling the results

Cons:

  1. RF can be computationally intensive with larger datasets with an increasing number of trees.
  2. RF is not easily interpretable. We don’t have much visibility into the features and their importance.
  3. Understanding what happens within the algorithm can be difficult to understand.

We have come to the end of the story! The story has covered the basic workings of RF with a small case study in python touching base on main topics & answering important questions.

Full code here → https://github.com/sammitiy/RF_heartDisease

Happy reading!

--

--