Random Forest

RIO
3 min readNov 18, 2019

--

Random Forest is a method to improve the accuracy if its prediction.

Before studying random forest, let’s go over decision tree quickly. You had better to know about decision tree when you study random forest.

Here is my story about random forest.

What is Random Forest?

The concept of random forest is similar to ensemble learning. Random forest is a classifier which uses multiple decision trees, as the name implies.

The fundamental system behind random forest is really simple, but powerful — the wisdom of crowds. In data science speak, the reason that the random forest model works so well is:

A large number of relatively uncorrelated models(trees) operating as a committee will outperform any of the individual constituent models.

In this case above, the prediction is 1 because majority of independent decision tree predicts the unknown result for 1.

As you can understand, the accuracy can be higher by random forest.

The low correlations between decision trees are crucial key when we apply random forest. The reason why this is important is totally same as the portfolio of stocks and shares. The investments in portfolio should be uncorrelated between them because they can cover their mistakes in the case of failure. We can say same thing in random forest. If the decision trees are uncorrelated, they can cover their misjudges. Therefore, there are two prerequisite for random forest:

  1. There needs to be some actual signal in our features so that models built using those features do better than random guessing.
  2. The predictions (and therefore the errors) made by the individual trees need to have low correlations with each other.

Uncorrelated is very important

Let’s assume we play a game, and the rule is below:

・Randomly one number is picked up from 0 to 100

・If the number generated is greater than 40, you can get the money you bet. However, if the number generated is less than or equal to 40, you lose money you bet.

  1. Game1 — play 100 times, and each time you bet 1$
  2. Game2 — play 10times, and each time you bet 10$
  3. Game3 — play 1 time, and you bet 100$

Of course, you want to earn money at least 1 $. Which do you choose? The expectation value of each game is completely same:

Expected value of game1: (0.6*1 + 0.4*-1)*100 = 20

Expected value of game2: (0.6*10 + 0.4*-10)*10 = 20

Expected value of game3: (0.6*100 + 0.4*-100)*1 = 20

Firstly, let’s visualise the result as distribution diagram

The image above illustrates the simulation of the game. It counts the result of each trial of the game.

Then the image above illustrates what percentage of possibility you can make money from game at least 1$. As you can see, playing 100 times bring you money at the possibility of 97%, while playing 10 times at 63% and playing 1 time at 60%.

The example means, even if the expected value is same, the result distribution can be completely different by the time you split the trial. The more we split the 100$, the possibility we can earn money will be higher. And this theory works only when the possibility(in this case generated number) is completely independent.

Hence, if the decision trees are completely independent, and there are multiple decision trees, the accuracy of the model will be higher because of the reason I introduced above.

To keep independence between models

Then how can we keep independence between decision trees?

We can use a method, called bagging. Bagging is making new dataset by aggregating data from parent dataset. I already introduced in previous story in medium.

https://medium.com/@ryotennis0503/ensemble-learning-95624d831407

--

--

RIO

blockchain, marketing, data-analysis, programming