Random Forest in Machine Learning 🌳
By referring to this article, you will be able to get a better idea of the Random forest algorithm and how to use it with real data set. Furthermore, this article consists of the following contents.
â—Ľ Types of the Machine Learning
â—Ľ Decision Tree and important terms
â—Ľ What is the Random Forest?
â—Ľ Why Random Forest?
â—Ľ Difference between Decision Tree and Random Forest
â—Ľ Application with real-world data set
Types of Machine Learning
Before moving on to the random forest, Let’s discuss some basics of machine learning. When it comes to machine learning there are mainly 3 categories. They are
- Supervised Learning
- Unsupervised Learning
- Reinforcement Learning
Supervised Learning (Inputs are provided as a labeled dataset)
- Classification problem => kind of problems wherein the outputs are categorical in nature like yes or no, true or false, and 0 or 1
We can divide this into 5 algorithmic categories. Algorithms under this problem can be classified into those algorithms. They are,
1. NaĂŻve Bayer classifier
2. Support Vector machine
3. Logistic regression
4. Decision tree
5. K-Nearest Neighbor (KNN)
- Regression problem => kind of problems used for continuous data like the number of rooms, predicting the price of land in a city.
Algorithms under this problem are,
1. Linear Regression
2. Nonlinear Regression
3. Bayesian Linear Regression
The other two types (unsupervised learning and reinforcement learning) are not going to discuss here since this article is only focusing on the random forest.
Decision Tree and important terms
Before discussing the random forest. Let’s discuss the decision tree and its important terms of it.
The Decision Trees are a type of Supervised Machine Learning (that is explain what the input is and what the corresponding output is in the training data) where the data is continuously split according to a certain parameter. The tree can be explained by two entities, namely decision nodes, and leaves. When it comes to the nodes, they are not 100 % purity but when it comes to the leaf nodes, they are 100 % purity. And also, Decision trees are used for handling non-linear data sets effectively. The decision tree tool is used in real life in many areas, such as engineering, civil planning, law, and business. Decision trees can be divided into two types. They are categorical variable and continuous variable decision trees.
Entropy => Entropy is the measure of randomness or unpredictability in the data set. When we consider the root node of the tree it has the highest entropy. But when it goes from root to leaf nodes the entropy gradually decreases until 0. Entropy is varying between 0 and 1. 0 means there are no two decisions (100 % one term). 1 means there are two decisions with 50%.
Information Gain => It is the measure of the decrease in entropy after the data set is split. Basically, we can say that the IG (E1-E2) is the difference of entropies between before and after split using an attribute. We can reduce the degree of uncertainty by a split.
What is the Random Forest?
Let’s discuss what is the random forest. A random forest or Random decision forest is a method/algorithm that operates by constructing multiple decision trees during the training phase. And also, this random forest method is categorized under the supervised learning technique. The decision of the majority of the trees is chosen by the random forest as the final decision. The below diagram shows the high-level concept of the operation of the random forest method.
So, you can see that decision tree 1 gives an apple as the output for my testing fruit. Then decision tree 2 gives orange as the output. Furthermore, decision 3 gives apple again as the output. Then we select all the trees together called random forest and we find whether what is the major output out of outputs. Finally, we select the answer as the majority answer.
Let’s discuss it furthermore for getting an understanding of how the random forest works.
In a random forest, we load our data set into a frame. Then what we do is choose random features of the data set and rows of the dataset. Then do the same procedure that we did in the decision tree. Again, select different random features/columns and rows from the dataset and create a decision tree. Likewise, we do the same procedure many times as we want. In the decision tree method, we consider all the features and rows in the dataset but here it is different. Hence the high variance of data output in the decision tree converts into a low variance one.
This is the high-level diagram that I have described above. Assume whether this is a binary classification problem.
Why Random Forest?
Let’s discuss some reasons weather why we need this random forest. Basically, when we are training the dataset in machine learning overfitting of the data is a common term. Overfitting is nothing but it is a concept in data science, which occurs when a statistical model fits exactly against its training data. When this occurs, the algorithm, unfortunately, cannot execute accurately against unseen data, and defeating its purpose. The generalization of a model to new data is ultimately what allows us to use machine learning algorithms every day to make predictions and classify data. But when we use the random forest as our algorithm, we can avoid or reduce this overfitting problem. Because this algorithm uses multiple decision trees which reduces the risk of overfitting.
And also, the accuracy of this algorithm is very high and it runs efficiently on large databases. Hence for large data, it produces a highly accurate prediction.
Furthermore, even if a large proportion of the data is missing in the dataset, a random forest algorithm can maintain the accuracy. It is a very valuable feature in this algorithm.
Difference between Decision Tree and Random Forest
As is implied by the names “Tree” and “Forest,” a Random Forest is essentially a collection of Decision Trees. A decision tree is built on an entire dataset, using all the features or variables of interest, whereas a random forest randomly selects observations or rows and specific features to build multiple decision trees from and then averages the results. After many numbers of trees are built using this method, each tree “votes” or chooses the class, and the class receiving the most votes by a simple majority is the “winner” or predicted class. There are of course some more detailed differences, but this is the main conceptual difference. Hence the accuracy of the final output of the random forest is higher and the variance is lower than the decision tree.
Application with real-world data set
For my real-world data set, I have chosen the Iris flower data set. Because It is a very popular data set.
The Iris flower data set or Fisher’s Iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis. It is sometimes called Anderson’s Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species.
The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica, and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters.
Reference: https://en.wikipedia.org/wiki/Iris_flower_data_set
So, my application is to classify the flower plants according to their features and finally, after training the data set, find an unknown plant name that has these 4 features. Code explanation and how did I my task and code flow are described in the colab notebook.
Link of Colab Notebook: https://colab.research.google.com/drive/1rmGafkDNw2xVw_Ovf3C8PnoUah3-89lU?usp=sharing