Random Forest with Practical Implementation

Amir Ali
The Art of Data Scicne
23 min readJul 17, 2018

In this chapter, we will discuss the Random Forest Algorithm which is used for both classification and regression problems too and It’s supervised machine learning algorithm.

This chapter spans 3 parts:

  1. What is a Random Forest Algorithm?
  2. How does the Random Forest Algorithm work in for Classification and Regression?
  3. Practical Implementation of Random Forest in Scikit.

1. What is a Random Forest Algorithm?

Random Forrest is a supervised algorithm used for both classification and regression problems too. We can see it from a supervised algorithm to create a forest in some way & make it random. The larger the number of trees the more accurate results.

Suppose a training set is just like that as [A, B, C, D] with corresponding labels as [R1, R2, R3, R4] from giving training set.

Random Forest creates three decision tree input of subset for example

Tree 1 = [A, B, C]

Tree 2 = [A, B, D]

Tree 3 = [B, C, D]

So finally, it predicts, based on the majority of votes from each of the decision tree nodes.

In Random Forrest Algorithm we create multiple decision trees for building multiple decision trees.

Using Algorithms such As Information Gain, Entropy and Gain

1.1 Information Gain:

Information Gain = I (p, n)

Here what are P and n? So to find p and n we check our class attribute or outcome which binary (0, 1). So for p, we take true value 1 (in case of binary) and for now, we take the false value 0 (binary value). We go deeper into the mathematical part just here introduction.

1.2 Entropy:

Entropy is baed to create a tree. We find our entropy from attribute or class.

1.3 Gain:

Gain = Information Gain — Entropy

Gain = I (p, n) — E (A)

The gain is basd to find one by one attribute of our training set.

1.4 Real-Life Example

Suppose I want to buy a smartwatch & I asked a friend for advice. My friend will ask what type of watch did you like & also ask for a range of prices. Based on my Question my friend starts to give the recommendation. Listen to my friend from the decision tree.

I want to ask more friends for advice because I think only with friends cannot help my other friends also ask me random questions and finally provide an answer. I consider the watch with the most votes.

Here note down points:

The friend created the rule based on the answers & used the rules to find the answer that matched the rules.

So on the based on highest votes I am selecting smartwatch. This is a typical Random Forrest Algorithm approach.

2. How does the Random Forest Algorithm Work in Classification and Regression?

We grow multiple trees as opposed to a single tree in the CART model to classify a new object based on attributes.

Each tree gives a classification & we say the tree falls that class

There are two stages in the Random forest algorithm, one is a random forest, creating the other is to predict the random forest classifier created in the first stage.

Give a random forest with B trees training step: iterate b times:

Sample with the replacement of the training sample from the training set.

Train a decision tree is on the sampled training set.

Training step:

To take the majority vote of all decision trees.

PSEUDO CODE:

Each tree planed & grows as follows:

  1. If there are M input variables or features, a number m<M is specified such that at each node, m variables are selected at random out of the M. The best split on these misused to split the node. The value of m is held constant while we grow in the forest.
  2. Each tree has grown to the largest extent possible and there is no pruning.
  3. Predict new data by aggregating the prediction of then trees (i.e. Majority votes for a classification average for regression). As shown above fig.

2.1. Random Forrest For Classification.

This dataset has 14 instances and five numbers of attributes. Here first four attributes are predictor and the last attribute is the target attribute.

Solution:

This dataset contains 14 instances and 4 attributes we use this dataset to make a decision tree in this solution we make a three decision trees.we divide our dataset into three parts for the first tree we take Day 1 to Day 3 dataset and make a tree for second decision tree we take Day 3 to Day 6 instances and for third decision tree we take Day 7 to Day 9 dataset . We use only 2:3 data set according to our pseudocode.

Note: we also study decision tree in details in chapter 4 here we make a multiple decision trees using the random forest algorithm so here we solve data set and make a tree not explanation so if you don’t know about how to make a tree using Gain and entropy then you should study chapter 4 which complete explanation about decision tree and then comes on this chapter.

2.1.1 For First Decision Tree we take Day1 to Day 3 instances

Information Gain

P: yes = 1

P: No = 2

Outlook

So Entropy of Outlook:

Gain (Outlook):

Gain (outlook) = Information Gain — Entropy (outlook)

Gain (outlook) = I (p, n) — Entropy (outlook)

Gain (outlook) = 0.9182–0

Gain (outlook) = 0.9182

Humidity

So Entropy of Humidity:

Gain (Humidity):

Gain (Humidity) = Information Gain — Entropy (Humidity)

Gain (Humidity) = I (p, n) — Entropy (Humidity)

Gain (Humidity) = 0.9182–0.9182

Gain (Humidity) = 0

Wind

So Entropy of Wind:

Gain (Wind):

Gain (Wind) = Information Gain — Entropy (Wind)

Gain (Wind) = I (p, n) — Entropy (Wind)

Gain (Wind) = 0.9182–0.66

Gain (Wind) = 0.25829

Now

Gain of Outlook = 0.91829

Gain of Humidity = 0

Gain of Wind = 0.25829

2.1.1 For Second Decision Tree we take Day 3 to Day 6 dataset.

Information Gain

P: yes = 3

P: No = 1

Outlook

Entropy of Outlook:

Gain (Outlook):

Gain (outlook) = Information Gain — Entropy (outlook)

Gain (outlook) = I (p, n) — Entropy (outlook)

Gain (outlook) = 0.81127–0.68865

Gain (outlook) = 0.12265

Humidity

So Entropy of Humidity:

Gain (Humidity):

Gain (Humidity) = Information Gain — Entropy (Humidity)

Gain (Humidity) = I (p, n) — Entropy (Humidity)

Gain (Humidity) = 0.81127–0.5

Gain (Humidity) = 0.31127

Wind

So Entropy of Wind

Gain (Wind):

Gain (Wind) = Information Gain — Entropy (Wind)

Gain (Wind) = I (p, n) — Entropy (Wind)

Gain (Wind) = 0.81127–0

Gain (Wind) = 0.81127

Now

Gain of Outlook = 0.11262

Gain of Humidity = 0.31127

Gain of Wind = 0.81127

2.1.1 For Third Decision Tree, we take D7, D8, D9 dataset

Information Gain

P: yes = 2

P: No = 1

Outlook

So Entropy of Outlook:

Gain (Outlook):

Gain (outlook) = Information Gain — Entropy (outlook)

Gain (outlook) = I (p, n) — Entropy (outlook)

Gain (outlook) = 0.81127–0.66

Gain (outlook) = 0.66

Humidity

So Entropy of Humidity:

Gain (Humidity):

Gain (Humidity) = Information Gain — Entropy (Humidity)

Gain (Humidity) = I (p, n) — Entropy (Humidity)

Gain (Humidity) = 0.9182–0

Gain (Humidity) = 0.9182

Wind

So Entropy of Wind:

Gain (Wind):

Gain (Wind) = Information Gain — Entropy (Wind)

Gain (Wind) = I (p, n) — Entropy (Wind)

Gain (Wind) = 0.81127–0.66

Gain (Wind) = 0.15127

Now

Gain of Outlook = 0.15127

Gain of Humidity = 0.9182

Gain of Wind = 0.15127

So These are three Decision Tree which we made above.

In a Decision tree, we make a decision only one decision tree but in the random forest, we make multiple decision trees and decide on the majority votes. So if we see our decision tree which we make above then we clearly see for first tree ratio for play is greater in second tree play ratio is greater and similarly in third so the majority votes are playing So should play.

So here is our output image from the above three decisions on the basis of majority votes.

Note: If you want this article check out my academia.edu profile.

2.2: Random Forrest For Regression.

Dataset:

This dataset has 14 instances and five numbers of attributes. Here first four attributes are predictor and the last attribute is the target attribute.

2.2.1 First tree we take the first four instances of our dataset to build the first tree.

a) The standard deviation for one attribute:

Standard Deviation (S) is for tree building (branching).

The coefficient of Deviation (CV) is used to decide when to stop branching. We can use Count (n) as well.

Average (Avg) is the value in the leaf nodes.

a) The standard deviation for two attributes (target and predictor):

Standard Deviation Reduction

The standard deviation reduction is based on the decrease in standard deviation after a dataset issplit on an attribute. Constructing a decision tree is all about finding an attribute that returns thehighest standard deviation reduction (i.e., the most homogeneous branches).

Step 1: The standard deviation of the target is calculated.

Standard deviation (Hours Played) = 9.63

Step 2: The dataset is then split into different attributes. The standard deviation foreach branch is calculated. The resulting standard deviation is subtracted from thestandard deviation before the split. The result is the standard deviation reduction.

Similarly find the standard deviation for temp, Humidity, and wind as shown below table:

Step 3: The attribute with the largest standard deviation reduction is chosen for the decision node.

Step 4a: The dataset is divided based on the values of the selected attribute. This process is run recursively on the non-leaf branches until all data is processed.

In practice, we need some termination criteria. For example, when the coefficient of deviation (CV) for a branch becomes smaller than a certain threshold (e.g., 10%) and/or when too few instances (n) remain in the branch (e.g., 3).

Step 4b: In Humdity “High” need further splitting because our threshold is above 10%.

Step 4c: However, the “High” branch has a CV (43%) more than the threshold (10%) which needs further splitting. We select “Outlook” as the best node after “Outlook” because it has the largest SDR.

In practice, we need some termination criteria. For example, when the coefficient of deviation (CV) for a branch becomes smaller than a certain threshold (e.g., 10%) and/or when too few instances (n) remain in the branch (e.g., 3).

Step 4d: In Outlook all have threshold is below 10% as we see in an above table so we don’t have any further splitting in our tree and our first tree from the first four instances is complete as we see below.

2.2.2 In the second tree, we take five to eight instances of our dataset to build a second tree in a random forest.

a) The standard deviation for one attribute:

Standard Deviation (S) is for tree building (branching).

The coefficient of Deviation (CV) is used to decide when to stop branching. We can use Count (n) as well.

Average (Avg) is the value in the leaf nodes.

b) The standard deviation for two attributes (target and predictor):

Standard Deviation Reduction

The standard deviation reduction is based on the decrease in standard deviation after a dataset is split on an attribute. Constructing a decision tree is all about finding an attribute that returns the highest standard deviation reduction (i.e., the most homogeneous branches).

Step 1: The standard deviation of the target is calculated.

Standard deviation (Hours Played) = 14.08

Step 2: The dataset is then split into different attributes. The standard deviation for each branch is calculated. The resulting standard deviation is subtracted from the standard deviation before the split. The result is the standard deviation reduction.

Similarly find the standard deviation for temp, Humidity, and wind as shown below table:

Step 3: The attribute with the largest standard deviation reduction is chosen for the decision node.

Step 4a: The dataset is divided based on the values of the selected attribute. This process is run recursively on the non-leaf branches until all data is processed.

In practice, we need some termination criteria. For example, when the coefficient of deviation (CV) for a branch becomes smaller than a certain threshold (e.g., 10%) and/or when too few instances (n) remain in the branch (e.g., 3).

Step 4b: In Outlook “Overcast and Rainy” don’t need further splitting because less than threshold and sunny have 45% CV so we need further split sunny.

Step 4c: However, the “Sunny” branch has a CV (45%) more than the threshold (10%) which needs further splitting. We select “Windy” as the best node after “Outlook” because it has the largest SDR.

In practice, we need some termination criteria. For example, when the coefficient of deviation (CV) for a branch becomes smaller than a certain threshold (e.g., 10%) and/or when too few instances (n) remain in the branch (e.g., 3).

Step 4d: In Windy both True and False have threshold is above 10% as we see in the above table so we need further split of true and false.

Step 4e: However, the “False and True “ branch has a CV (15% & 26%) more than the threshold (10%) which needs further splitting. We select “Temp” as the best node after “Windy” because it has the largest SDR.

In practice, we need some termination criteria. For example, when the coefficient of deviation (CV) for a branch becomes smaller than a certain threshold (e.g., 10%) and/or when too few instances (n) remain in the branch (e.g., 3).

Step 4f: In Temp mild have threshold is below 10% and we don’t need further split but as we see in the above table so cool have 24%cv which have above 10% so cool brach further split

Step 4g: However, the “cool” branch has a CV (12%) more than the threshold (10%) which needs further splitting. We select “Humidity” as the best node after “Temp” because it has the largest SDR.

In practice, we need some termination criteria. For example, when the coefficient of deviation (CV) for a branch becomes smaller than a certain threshold (e.g., 10%) and/or when too few instances (n) remain in the branch (e.g., 3).

Step 4h: In Humidity High have threshold is below 10% and we don’t need further split but as we see in an above table so Normal has 24%cv which has above 10% so Normal brach further split.

Step 4i: However, the “Normal” branch has a CV (12%) more than the threshold (10%) which needs further splitting. But Notice that no further attribute is left so our above tree is final.

2.2.3 In the third tree, we take the last 4 instances of our dataset to build a third tree in a random forest.

a) The standard deviation for one attribute:

Standard Deviation (S) is for tree building (branching).

The coefficient of Deviation (CV) is used to decide when to stop branching. We can use Count (n) as well.

Average (Avg) is the value in the leaf nodes.

b) The standard deviation for two attributes (target and predictor):

Standard Deviation Reduction

The standard deviation reduction is based on the decrease in standard deviation after a dataset is split on an attribute. Constructing a decision tree is all about finding an attribute that returns the highest standard deviation reduction (i.e., the most homogeneous branches).

Step 1: The standard deviation of the target is calculated.

Standard deviation (Hours Played) = 11.40

Step 2: The dataset is then split into different attributes. The standard deviation for each branch is calculated. The resulting standard deviation is subtracted from the standard deviation before the split. The result is the standard deviation reduction.

Similarly find the standard deviation for temp, Humidity, and wind as shown below table:

Step 3: The attribute with the largest standard deviation reduction is chosen for the decision node.

Step 4a: The dataset is divided based on the values of the selected attribute. This process is run recursively on the non-leaf branches until all data is processed.

In practice, we need some termination criteria. For example, when the coefficient of deviation (CV) for a branch becomes smaller than a certain threshold (e.g., 10%) and/or when too few instances (n) remain in the branch (e.g., 3).

Step 4b: In Outlook “Sunny and Rainy” don’t need further splitting because less than threshold and sunny have 26% CV so we need further split sunny.

Step 4c: However, the “Overcast” branch has a CV (26%) more than the threshold (10%) which needs further splitting. We select “Humidity” as the best node after “Outlook” because it has the largest SDR.

In practice, we need some termination criteria. For example, when the coefficient of deviation (CV) for a branch becomes smaller than a certain threshold (e.g., 10%) and/or when too few instances (n) remain in the branch (e.g., 3).

Step 4d: In Humidity “High” have threshold is less 10% so we don’t any split. and “Normal” have above threshold 10% so we need further split “Normal”.

Step 4e: However, the “Normal” branch has a CV (34%) more than the threshold (10%) which needs further splitting. We select “Temp” as the best node after “Humidity” because it has the largest SDR.

In practice, we need some termination criteria. For example, when the coefficient of deviation (CV) for a branch becomes smaller than a certain threshold (e.g., 10%) and/or when too few instances (n) remain in the branch (e.g., 3).

Step 4f: In Temp mild have Hot is below 10% and we don’t need further split but as we see in the above table so Mild have 28% cv which have above 10% so mild brach further split

Step 4g: However, the “Mild” branch has a CV (28%) more than the threshold (10%) which needs further splitting. We select “Windy” as the best node after “Temp” because it has the largest SDR.

In practice, we need some termination criteria. For example, when the coefficient of deviation (CV) for a branch becomes smaller than a certain threshold (e.g., 10%) and/or when too few instances (n) remain in the branch (e.g., 3).

Step 4h: In Windy “False” have threshold is below 10% and we don’t need further split but as we see in an above table so “True” has 28% cv which has above 10% so “True” brach further split.

Step 4i: However, the “True” branch has a CV (28%) more than the threshold (10%) which needs further splitting. But Notice that no further attribute is left so our above tree is final.

Summary: So That’s the end of Random Forest we split our dataset into 3 parts and each part we made a tree and made a three tree so on the majority vote we can predict the Hours Played in our dataset.

Note: If you want this article check out my academia.edu profile.

3. Practical Implementation of Random Forest in Scikit Learn.

3.1: Classification Approach

Dataset Description:

This Dataset has 400 instances and 5 attributes which is a User ID, Gender, Age, Estimate Salary and last is Purchased which Target attributes. First Four Attribute which is ID, Gender, Age, Estimate Salary is predictor and the last attribute is the target attribute.

Part 1: Data Preprocessing:

1.1 Import the Libraries

In this step, we import three Libraries in Data Preprocessing part. A library is a tool that you can use to make a specific job. First of all, we import the numpy library used for multidimensional array then import the pandas library used to import the dataset and in last we import matplotlib library used for plotting the graph.

1.2 Import the dataset

In this step, we import the dataset to do that we use the pandas library. After import our dataset we define our Predictor and target attribute. Our Predictor attributes are a User ID, Gender, Age and Estimated Salary as you can see in the sample dataset which we call ‘X’ here and Purchased is a target attribute which we call ‘y’ here.

1.3 Split the dataset for test and train

In this step, we split our dataset into a test set and train set and a 75% dataset split for training and the remaining 25% for tests.

1.4 Feature Scaling

Feature Scaling is the most important part of data preprocessing. If we see our dataset then some attribute contains information in Numeric value some value very high and some are very low if we see the age and estimated salary. This will cause some issues in our machinery model to solve that problem we set all values on the same scale there are two methods to solve that problem first one is Normalize and Second is Standard Scaler.

Here we use standard Scaler import from Sklearn Library.

Part 2: Building the Random Forest classifier model:

In this part, we model our Random Forest Classifier model using Scikit Learn Library.

1.1 Import the Libraries

In this step, we are building our Random Forest model to do this first we import a Random Forest model from Scikit Learn Library.

1.2 Initialize our Random Forest Classifier model

In this step, we initialize our model. Note down here I passed the parameter first parameter is n_estimetors basically it’s the number of tree in forest and we chose 10 trees in random forest second parameter is criterion basically it’ purpose (the function to measure the quality of a split. Supported criteria are “gini” for the Gini index impurity and entropy for the information gain).

1.3 Fitting the Random Forest Classifier Model

In this step, we fit the training data into our model X_train, y_train is our training data.

Part 3: Making the Prediction and Visualizing the result:

In this Part, we make a prediction of our test set dataset and visualizing the result using the matplotlib library.

3.1 Predict the test set Result

In this step, we predict our test set result.

3.2 Confusion Metric

In this step we make a confusion metric of our test set result to do that we import confusion matrix from sklearn.metrics then in confusion matrix, we pass two parameters first is y_test which is the actual test set result and second is y_pred which predicted result.

3.3 Accuracy Score

In this step, we calculate the accuracy score on the basis of the actual test result and predict test results.

3.4 Visualize our Test Set Result

In the step, we visualize our test set result to do this we use a matplotlib library and we can see only 8 points are the incorrect map and the remaining 92 are the correct map in the graph according to the model test set result.

If you want dataset and code you also check my Github Profile.

3.2: Regression approach

Dataset Description:

This dataset contains information about position salaries which have three attributes that have the position, level, and salary. This dataset has 10 instances. As we see our target attribute has regression value so this is a regression problem and we use the decision tree regression model to predict the salaries on the basis of predictor attribute which is Level and Position.

Part 1: Data Preprocessing:

1.1 Import the Libraries

In this step, we import three Libraries in Data Preprocessing part. A library is a tool that you can use to make a specific job. First of all, we import the numpy library used for multidimensional array then import the pandas library used to import the dataset and in last we import matplotlib library used for plotting the graph.

1.2 Import the dataset

In this step, we import the dataset to do that we use the pandas library. After import our dataset we define our Predictor and target attribute. Our Predictor attributes are a Position and Level as you can see in the sample dataset which we call ‘X’ here and Salary is a target attribute which we call ‘y’ here.

Note: This dataset is too small only 10 instances so we not split the data into train and test set

Part 2: Building the Random Forest Regression model:

In this part, we model our Forest Regression model using Scikit Learn Library.

2.1 Import the Libraries

In this step, we are building our Decision Tree model to do this first we import a Decision Tree model from Scikit Learn Library.

2.2 Initialize our Random Forest model

In this step, we initialize our Regressor model. Here n_estimators = 10 mean we take 10 trees.

2.3 Fitting the Random Forest Regressor Model

In this step, we fit the data into our model.

Part 3: Making the Prediction and Visualizing the result:

In this Part, we make a prediction and visualizing the result using the matplotlib library.

3.1 Predict the Result

Our dataset is in real value so for predict method, we enter the value which 6.5( level of position) and our model gave the prediction 167000.

3.2 Visualize Result

In the Visualizing step, we make a graph between position level and salary which is our predicted output. So on the basis of level and salary set the point(red color) and then predict the salary which blue line.

If you want dataset and code you also check my Github Profile.

End Notes:

If you liked this article, be sure to click ❤ below to recommend it and if you have any questions, leave a comment and I will do my best to answer.

For being more aware of the world of machine learning, follow me. It’s the best way to find out when I write more articles like this.

You can also follow me on Github for code & dataset follow on Aacademia.edu for this article, Twitter and Email me directly or find me on LinkedIn. I’d love to hear from you.

That’s all folks, Have a nice day :)

--

--