How Random Forest Algorithm Works in Machine Learning
This is one of the best introductions to Random Forest algorithm. The author introduces the algorithm with a real-life story and then provides applications in four different fields to help beginners learn and know more about this algorithm.
To begin the article, the author highlights one advantage of Random Forest algorithm that excites him: that it can be used for both classification and regression problems. (Random Forest algorithm also have the other advantages, which will be shown at the end of the article). The author chose a classification task for this article, as this will be easier for a beginner to learn. Regression will be the application problem in the next, up-coming article.
This article spans six parts:
- What is Random Forest algorithm?
- Why Random Forest algorithm?
- Random Forest algorithm real life example.
- How Random Forest algorithm works?
- Random Forest algorithm Application.
- Advantages of Random Forest algorithm.
What is Random Forest algorithm?
First, Random Forest algorithm is a supervised classification algorithm. We can see it from its name, which is to create a forest by some way and make it random. There is a direct relationship between the number of trees in the forest and the results it can get: the larger the number of trees, the more accurate the result. But one thing to note is that creating the forest is not the same as constructing the decision with information gain or gain index approach.
The author gives 4 links to help people who are working with decision trees for the first time to learn it, and understand it well. The decision tree is a decision support tool. It uses a tree-like graph to show the possible consequences. If you input a training dataset with targets and features into the decision tree, it will formulate some set of rules. These rules can be used to perform predictions. The author uses one example to illustrate this point: suppose you want to predict whether your daughter will like an animated movie, you should collect the past animated movies she likes, and take some features as the input. Then, through the decision tree algorithm, you can generate the rules. You can then input the features of this movie and see whether it will be liked by your daughter. The process of calculating these nodes and forming the rules is using information gain and Gini index calculations.
The difference between Random Forest algorithm and the decision tree algorithm is that in Random Forest, the process es of finding the root node and splitting the feature nodes will run randomly.
Why Random Forest algorithm?
The author gives four advantages to illustrate why we use Random Forest algorithm. The one mentioned repeatedly by the author is that it can be used for both classification and regression tasks. Overfitting is one critical problem that may make the results worse, but for Random Forest algorithm, if there are enough trees in the forest, the classifier won’t overfit the model. The third advantage is the classifier of Random Forest can handle missing values, and the last advantage is that the Random Forest classifier can be modeled for categorical values.
Random Forest algorithm real life example.
In this section, the author gives us a real-life example to make the Random Forest algorithm easy to understand. Suppose Mady wants to go to different places that he may like for his two-week vacation, and he asks his friend for advice. His friend will ask where he has been to already, and whether he likes the places that he’s visited. Based on Mady’s answers, his friend starts to give the recommendation. Here, his friend forms the decision tree.
Mady wants to ask more friends for advice because he thinks only one friend cannot help him make an accurate decision. So his other friends also ask him random questions, and finally, provides an answer. He considers the place with the most votes as his vacation decision. Here, the author provides an analysis for this example.
His one friend asked him some questions and gave the recommendation of the best place based on the answers. This is a typical decision tree algorithm approach. The friend created the rules based on the answers and used the rules to find the answer that matched the rules.
Mady’s friends also randomly asked him different questions and gave answers, which for Mady are the votes for the place. At the end, the place with the highest votes is the one Mady will select to go. This is the typical Random Forest algorithm approach.
How Random Forest algorithm works?
There are two stages in Random Forest algorithm, one is random forest creation, the other is to make a prediction from the random forest classifier created in the first stage. The whole process is shown below, and it’s easy to understand using the figure.
Here the author firstly shows the Random Forest creation pseudocode:
- Randomly select “K” features from total “m” features where k << m
- Among the “K” features, calculate the node “d” using the best split point
- Split the node into daughter nodes using the best split
- Repeat the a to c steps until “l” number of nodes has been reached
- Build forest by repeating steps a to d for “n” number times to create “n” number of trees
Figure 2 shows the process of randomly selecting features:
In the next stage, with the random forest classifier created, we will make the prediction. The random forest prediction pseudocode is shown below:
- Takes the test features and use the rules of each randomly created decision tree to predict the outcome and stores the predicted outcome (target)
- Calculate the votes for each predicted target
- Consider the high voted predicted target as the final prediction from the random forest algorithm
The process is easy to understand, but it’s somehow efficient.
Random Forest algorithm Application.
In this article, the authors give us four applications of using Random Forest algorithm: Banking, Medicine, Stock Market and E-commerce:
- For the application in banking, Random Forest algorithm is used to find loyal customers, which means customers who can take out plenty of loans and pay interest to the bank properly, and fraud customers, which means customers who have bad records like failure to pay back a loan on time or have dangerous actions.
- For the application in medicine, Random Forest algorithm can be used to both identify the correct combination of components in medicine, and to identify diseases by analyzing the patient’s medical records.
- For the application in the stock market, Random Forest algorithm can be used to identify a stock’s behavior and the expected loss or profit.
- For the application in e-commerce, Random Forest algorithm can be used for predicting whether the customer will like the recommend products, based on the experience of similar customers.
Advantages of Random Forest algorithm.
Compared with other classification techniques, there are three advantages as the author mentioned.
- For applications in classification problems, Random Forest algorithm will avoid the overfitting problem
- For both classification and regression task, the same random forest algorithm can be used
- The Random Forest algorithm can be used for identifying the most important features from the training dataset, in other words, feature engineering.
Tech Analyst: Shixin Gu | Reviewer: Qingtong Wu