Random Forest (Easily Explained)

Shubham Gupta
9 min readJun 11, 2020

--

(With Python implementation in depth!)

Random Forest is an ensemble technique which can be used for both regression and classification tasks. An ensemble method is a technique that combines the predictions from multiple machine learning algorithms together to make more accurate predictions than any individual model.

Random Forests basically combine the simplicity of decision tree with flexibility resulting in a vast improvement in accuracy. It is also called “Bagging”(Bootstrap Aggregation) and the main goal of the Random Forest is to reduce the variance of the decision tree.

Low Bias and High Variance: Over-fitting (this is where we use Random Forest to minimize the variance by splitting the data into chunks of features/data and train it).

Random Forest is used when our goal is to reduce the variance of a decision tree. Here idea is to create several subsets of data from the training samples chosen randomly with replacement. Now, each collection of subset data is used to train their decision trees. As a result, we end up with an ensemble of different models. Average of all the predictions from different trees are used which is more robust than a single decision tree.

Why Random Forest?

You might be wondering to ask why not just decision tree? It seems like the perfect classifier since it did not make any mistakes! A critical point to remember is that the tree made no mistakes on the training data. We expect this to be the case since we gave the tree the answers and didn’t limit the max depth (number of levels). The objective of a machine learning model is to generalize well to new data it has never seen before.

Over-fitting occurs when we have a very flexible model(the model has a high capacity) which essentially memories the training data by fitting it closely. The problem is that the model learns not only the actual relationships in the training data but also any noise that is present.

The reason the decision tree is prone to over-fitting when we don’t limit the maximum depth is that it has unlimited flexibility, meaning that it can keep growing until it has exactly one leaf node for every single observation, perfectly classifying all of them.

If you limit the maximum depth to 2 (making only a single split) in the decision tree, the classifications are no longer 100% correct. We have reduced the variance of the decision tree but at the cost of increasing the bias.

Decision trees are sensitive to the specific data on which they are trained. If the training data is changed (e.g. a tree is trained on a subset of the training data) the resulting decision tree can be quite different and in turn, the predictions can be quite different.

As an alternative to limiting the depth of the tree, which reduces variance (good) and increases the bias (bad), we can combine many decision trees into a single ensemble model known as the random forest.

How does Random Forest work?

Overview of Random Forest

In the random forest, we grow multiple trees in a model. To classify a new object based on new attributes each tree gives a classification and we say that tree votes for that class. The forest chooses the classifications having the most votes of all the other trees in the forest and takes the average difference from the output of different trees. In general, Random Forest built multiple trees and combines them together to get a more accurate result.It can be used for both classification and regression problems.

How to choose number of trees to be included in the forest?

The only parameters when bagging decision trees is the number of samples and hence the number of trees to include.

This can be chosen by increasing the number of trees on run after run until the accuracy begins to stop showing improvement (e.g. on a cross-validation test harness). Very large numbers of models may take a long time to prepare, but will not over-fit the training data.

Applications of Random Forest (real-life):

Banking Sector: The banking sector consists of most users. There are many loyal customers and also fraud customers. To determine whether the customer is a loyal or fraud, Random forest analysis comes in. With the help of a random forest algorithm in machine learning, we can easily determine whether the customer is fraud or loyal. A system uses a set of a random algorithm which identifies the fraud transactions by a series of the pattern.

Medicines: Medicines need a complex combination of specific chemicals. Thus, to identify the great combination in the medicines, Random forest can be used. With the help of a machine learning algorithm, it has become easier to detect and predict the drug sensitivity of a medicine. Also, it helps to identify the patient’s disease by analyzing the patient’s medical record.

So let’s make a Random Forest!!!

Step 1) Importing necessary libraries:

%matplotlib inline

import time
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.model_selection import KFold, cross_val_score
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier

plt.style.use(‘ggplot’)
pd.set_option(‘display.max_columns’, 500)

Step 2) Data loading: Download the data from below link and use it in the model building.

DataSet Link
breast_cancer = pd.read_csv(“data.csv”)

Step 3) Data Cleaning:
We do some minor clean-age like setting the id_number to be the data frame index, along with converting the diagnosis to the standard binary 1, 0 representation using the map() function.

breast_cancer.set_index(breast_cancer[‘id’], inplace = True)
breast_cancer[‘diagnosis’] = breast_cancer[‘diagnosis’].map({‘M’:1, ‘B’:0})

#Checking and dropping the missing values columns.

breast_cancer.apply(lambda x: x.isnull().sum())
del breast_cancer[‘Unnamed: 32’]

Step 4) Creating training and test sets:

feature_space = breast_cancer.iloc[:, breast_cancer.columns!=‘diagnosis’]
feature_class = breast_cancer.iloc[:, breast_cancer.columns==‘diagnosis’]

training_set, test_set, class_set, test_class_set=train_test_split(feature_space, feature_class,test_size = 0.20, random_state = 42)

Step 5) Tuning and Fitting Random Forest:

Now, let’s create the model, starting with parameter tuning. Here we are using “GridSearchCV” to tune our model. GridSearchCV is a method to search the candidate best parameters exhaustively from the grid of given parameters. It will give us the best parameters which we can use to get the more accurate model.

Below are the parameters we will be tuning in this tutorial:

i) max_depth: The maximum splits for all trees in the forest. bootstrap: An indicator of whether or not we want to use bootstrap samples when building trees.
ii) max_features: The maximum number of features that will be used in node splitting — the main difference I previously mentioned between bagging trees and random forest. Typically, you want a value that is less than p, where p is all features in your data set.

iii) criterion: This is the metric used to asses the stopping criteria for the decision trees.

# Set the random state for reproducibility
fit_rf = RandomForestClassifier(random_state=42)

Algorithm tuning:
np.random.seed(42)
param_dist = {‘max_depth’: [2, 3, 4],
‘bootstrap’: [True, False],
‘max_features’: [‘auto’, ‘sqrt’, ‘log2’, None],
‘criterion’: [‘gini’, ‘entropy’]}

cv_rf = GridSearchCV(fit_rf,
cv = 10,
param_grid=param_dist,
n_jobs = 3)

cv_rf.fit(training_set, class_set)

print(‘Best Parameters using grid search: \n’, cv_rf.best_params_)

Once we are given the best parameter combination, we set the parameters to our model.Set best parameters given by grid search.

fit_rf.set_params(criterion = ‘entropy’, max_features = ‘log2’, max_depth = 4)

Step 6) Find the number of decision tree in thee forest to be added:

Now we will find the “n_estimators” value which is the number of decision trees in the forest. For this we are going to use the OOB error(out of bag error) rate. For more details on OOB error please visit “OOB-Error”.

class_set = class_set.values.ravel()
test_class_set = test_class_set.values.ravel()

fit_rf.set_params(warm_start=True,
oob_score=True)

min_estimators = 15
max_estimators = 1000

error_rate = {}

for i in range(min_estimators, max_estimators + 1):
fit_rf.set_params(n_estimators=i)
fit_rf.fit(training_set, class_set)

oob_error = 1 — fit_rf.oob_score_
error_rate[i] = oob_error

# Convert dictionary to a pandas series for easy plotting

oob_series = pd.Series(error_rate)

Plotting the OOB rate to find the number of decision tree for our data model:

fig, ax = plt.subplots(figsize=(10, 10))

ax.set_facecolor(‘#fafafa’)

oob_series.plot(kind=’line’,
color = ‘red’)
plt.axhline(0.055,
color=’#875FDB’,
linestyle=’ — ‘)
plt.axhline(0.05,
color=’#875FDB’,
linestyle=’ — ‘)
plt.xlabel(‘n_estimators’)
plt.ylabel(‘OOB Error Rate’)
plt.title(‘OOB Error Rate Across various Forest sizes \n(From 15 to 1000 trees)’)

The OOB error rate starts to oscillate at around 400 trees, so we will go ahead and use the judgement to use 400 trees in our forest.
#Using the pandas series object I can easily find the OOB error rate for the estimator as follows:

print(‘OOB Error rate for 400 trees is: {0:.5f}’.format(oob_series[400]))

Fit our model with the “n_estimator” now:

fit_rf.set_params(n_estimators=400,
bootstrap = True,
warm_start=False,
oob_score=False)

Fitting the model now:

fit_rf.fit(training_set, class_set)

Now, predict the output of test set and check the model accuracy(we can also create the confusion metrics):

predictions_rf = fit_rf.predict(test_set
print(accuracy_score(test_set, predictions_rf)

Step 7) Split the model using “Cross Validation” and check the accuracy:

Now , we will use on of term known as “Cross Validation” to split our data into k folds and then train the model. This will give us more accurate model and will help us to reduce the over-fitting problem as well.

Brief idea on “Cross Validation”:

Cross validation is a powerful tool that is used for estimating the predictive power of your model and it performs better than the conventional training and test set. Using cross validation, we can create multiple training and test sets and average the scores to give us a less biased metric.

In this case, we will create 10 sets within our data set that calculate the estimations we have done already then average the prediction error to give us a more accurate representation of our model’s prediction power. The model’s performance can vary significantly when utilizing different training and test sets.

Here we are employing K-fold cross validation; more specifically, 10 folds. We are creating 10 subsets of our data on which to employ the training and test set methodology.Then we will average the accuracy for all folds to give us our estimation.

Within a random forest context, if your data set is significantly large, you can choose to not do cross validation and instead use the OOB error rate as an unbiased metric for computational costs. But for the purposes of this tutorial,
I included it to show the different accuracy metrics available.

#K-Fold Cross Validation

def cross_val_metrics(fit, training_set, class_set, estimator, print_results = True):
n = KFold(n_splits=10)
print(n)
scores = cross_val_score(fit,
training_set,
class_set,
cv = n)

print(scores)
for i in range(0, len(scores)):
print(“Cross validation run {0}: {1: 0.3f}”.format(i, scores[i]))
print(“Accuracy: {0: 0.3f} (+/- {1: 0.3f})”\
.format(scores.mean(), scores.std() / 2))

cross_val_metrics(fit_rf,
training_set,
class_set,
‘rf’,
print_results = True)

Check the prediction and the model accuracy now:

predictions_rf = fit_rf.predict(test_set
print(accuracy_score(test_set, predictions_rf)

confusion_matrix = confusion_matrix(test_set, predictions_rf)

You will notice an increase in the model accuracy when we uses the “Cross Validation” approach to split our data. For exploring the k-fold cross validation please visit “k-Fold-Cross-Validation”.

Advantages and Disadvantages of Random Forest:

Advantages

The following are the advantages of Random Forest algorithm −

  • It overcomes the problem of over-fitting by averaging or combining the results of different decision trees.
  • Random forests work well for a large range of data items than a single decision tree does.
  • Random forest has less variance then single decision tree.
  • Random forests are very flexible and possess very high accuracy.
  • Random Forest algorithms maintains good accuracy even a large proportion of the data is missing.

Disadvantages

The following are the disadvantages of Random Forest algorithm −

  • Complexity is the main disadvantage of Random forest algorithms.
  • Construction of Random forests are much harder and time-consuming than decision trees.
  • More computational resources are required to implement Random Forest algorithm.
  • The prediction process using random forests is very time-consuming in comparison with other algorithms.

Conclusion:

Random forest is one of those algorithms which comes to the mind of every data scientist to apply on a given problem. It has been around for a long time and has successfully been used for such a wide number of tasks that it has become common to think of it as a basic need. It is a versatile algorithm and can be used for both regression and classification.

--

--