Journey through Random Forests

Karthik Sundar
Delta Force
Published in
5 min readFeb 14, 2022

Random Forest is a machine learning algorithm used widely in classification and regression problems. As the name says, a random forest is a collection of decision trees. A decision tree performs very well on the data it is trained on, however, it performs poorly when we use it on test data. Thus we bring in some randomness into the process of building them and we also build several of them.

Steps to build a Random Forest

  1. From the original dataset, we create a bootstrapped dataset. We randomly select the entries from the original dataset and we are allowed to have duplicate entries in the bootstrapped dataset. In ML lingo, this process is called “bagging”.
  2. Now we build a decision tree based on the bootstrapped dataset. Unlike in the case of decision trees, we choose two features at random when we want to branch.
  3. We repeat the first two steps several times, to build our forest.

Working of a Random Forest

We run our new sample in every tree and then we take the output which is said by the majority of the trees. To evaluate our random forest, we can use the samples which were not selected in step 1. These samples are called out-of-bag samples.

How to deal with missing data?

There are two types of missing data,

  1. Missing Data in the original dataset used to create the random forest.
  2. Missing Data in the new sample that you want to categorize.

So when we have missing data in our original dataset, we make a guess and then we try to refine the guess. In the case of a categorical type of data, we first look at the other data points having the same target value. Then we choose the category which has occurred a lot. In the case of a numerical type of data, we instead take the median of the data. However, there are also other ways to deal with missing data in the original dataset. Now that we have filled the data, to refine the data we use something called a proximity matrix.

A proximity matrix is a two-dimensional grid, describing the similarity of two data points. Whenever two data points end on the same leaf node of a decision tree, we add 1 to the respective cells. In the end, we divide every value in the matrix by the total number of trees.

Now based on this matrix, we refine our guesses. For categorical data, we take the proportion of occurrence and multiply it by the weight of the data to get the weighted frequency.

Let's say we already built the random forest and our new data has some missing values. In this case, we make a copy of the data for each target value.

Then we find the values for the missing variable using the proximity matrix. After doing this, we run both sets of values in all the trees and take the output which we get from the majority of the trees.

Project: Heart Attack Prediction

Now, let’s do a small project using a random forest classifier. First, we need to download the dataset from Kaggle. You can download it here.

After we have our dataset, let’s open a Jupyter notebook and start writing code.

First, let’s import the needed libraries and read the CSV file,

Let’s start looking into the data frame,

We check for any duplicated values in the dataset.

df[df.duplicated()]

We get the following output, indicating that there is a duplicate row

Now, we have to delete that row,

updated_df = df.drop(164)

Now, our next step is to split the dataset into testing and training datasets. We decide to use 30% of the data as testing data.

Now we use the random classifier model from Sklearn,

From the output we can see that our model performed with an 87% accuracy, let’s try to increase accuracy by changing the number of trees in our forest

From the output graph, we can see that our accuracy peaks around 107 trees.

The thing here is, as these trees are random, they tend to show different graphs each time we run. Thus, it is a hard thing to get right.

Building a Random Forest From Scratch

To build our tree, we will need the mathematical formulas for Entropy and Information Gain,

In the case of Binary Classification, we can say that entropy is as follows

From Entropy, we calculate another important parameter called information gain

Now it’s time to get to code,

First, let's code the functions to calculate Entropy and Information Gain

Now our next step is to generate our bootstrapped dataset from the given dataset.

We have also added a helper function to calculate the out-of-bag score to evaluate our tree.

Next, we need a function to split the tree at specific values of random features.

So we will define a function to select features at random and for each feature, find the information gained for each value in the selected features. Finally, we return a dictionary which is basically our node in the tree consisting of:

  • the feature index
  • the value to split at
  • Left child node
  • Right child node

We also need a function to decide when to stop splitting the node. We check whether the right child or left child has 0 observations, if that’s the case, we have a terminal node. We also have a terminal node, if we reached the max_depth. We also check if the right child or left child has reached min_samples_split

Our next step is to build our forest.

Thus, we have built a random forest from scratch.

Hope you guys enjoyed the article!

Cheers!

--

--