Week 5 — Random Forest

Published in

bbm406f19

3 min readDec 29, 2019

Gorkha after the earthquake — **AFP PHOTO / SAJJAD HUSSAIN**

This week we are continuing with our next classification algorithms which are Decision Tree and Random Forest. First of all, we will give brief information about these algorithms.

A “Decision Tree” is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chances of event outcomes, resource costs, and utility. A decision tree is a flowchart-like structure in which each internal node represents a “test” on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label . The paths from the root to the leaf represent classification rules.

While separating the data, decision trees use “Entropy” which is a measure of randomness in data-set. It calculates entropy for each class and chooses the class which has minimum entropy. Then it separates the data for that feature.

Even though it is easy to understand, implementation has several disadvantages. One and probably most important one is, decision tree can easily over-fit. Because the algorithm can create over-complex trees that do not generalize the data well. But we can solve this problem with “Pruning” which helps not memorizing training data with removing sub-nodes of the decision tree. One of the other disadvantages is while working with continuous numerical variables, the decision tree loses information when it categorizes variables in different categories.

Random forest is a learning method that operates by constructing multiple decision trees. The final decision is based on the majority of the trees. It first selects random samples from the dataset. Then it constructs a decision tree for every sample and gets prediction results. Finally, it votes these results and makes a prediction.

In our experiment, to be a baseline we tried the decision tree and we get a %39 prediction result. Which is quite impressive when you think about there is no hyper-parameter to tune.

Then we tried to improve our results and started testing on random forest. Even with one tree, we got a %42 accuracy result. And with a large number of tree results, it was getting better. A larger number of trees decreased our training speed but when comparing with kNN training speed it was still very fast and accurate. With 500 trees we get a %45 accuracy result and it only takes 45 seconds.

Lastly, we have examined the confusion matrix of Random Forest and we realized our algorithm gets better results on damage grades 1 and 5. We think it is because the dataset that we have is imbalanced and we will try to solve this problem with the use of some methods like re-sampling, changing performance metrics and trying penalized models, etc…

See you next week!

Week 5 — Random Forest

Written by Öner İnce