Classification and Regression by Random Forest

Introduction:

Random Forest is one the most popular and common machine learning algorithms, because of it’s simplicity and it’s flexible that it can be used in classification as well as in regression based problems.Today, we are going to learn how random forest works.

RANDOM FOREST

  • Random forest is an ensemble classifier(methods that generate many classifiers and aggregate their results) that consists of many decision trees and outputs the class that is the mode of the class’s output by individual trees.
  • The term came from random decision forests that was first proposed by Tin Kam Ho of Bell Labs in 1995.
  • The method combines Breiman’s “bagging” (randomly draw datasets with replacement from the training data, each sample the same size as the original training set) idea and the random selection of features
  • It is very user friendly as it has only two parameters i.e number of variables and number of trees.

DECISION TREE

Decision trees are individual learners that are combined. They are one of the most popular learning methods commonly used for data exploration.The idea behind a tree is to search for a pair of variable-value within the training set and split it in such a way that will generate the “best” two child nodes. The goal is to create branches and leafs based on an optimal splitting criteria, a process called tree growing. Specifically, at every branch or node, a conditional statement classifies the data point based on a fixed threshold in a specific variable, therefore splitting the data.

ALGORITHM:

Each tree is constructed using the following algorithm:

* Let the number of training cases be N, and the number of variables in the classifier be M.
* We are told the number m of input variables to be used to determine the decision at a node of the tree; m should be much less than M.
* Choose a training set for this tree by choosing n times with replacement from all N available training cases (i.e. take a bootstrap sample). 
* For each node of the tree, randomly choose m variables on which to base the decision at that node. Calculate the best split based on these m variables in the training set.
* Each tree is fully grown and not pruned (as may be done in constructing a normal tree classifier).
For prediction a new sample is pushed down the tree. It is assigned the label of the training sample in the terminal node it ends up in. This procedure is iterated over all trees in the ensemble, and the average vote of all trees is reported as random forest prediction

FLOWCHART :

ALGORITHM FLOWCHART

GINI INDEX

Random Forest uses the gini index taken from the CART learning system to construct decision trees. The gini index of node impurity is the measure most commonly chosen for classification-type problems. 
If a dataset T contains examples from n classes

Gini index

If a dataset T is split into two subsets T1 and T2 with sizes N1 and N2 respectively, the gini index of the split data contains examples from n classes, the gini index (T) is defined as:

Gini Index Formula

Out-Of-Bag:

Out-Of-Bag (also called oob error). In this sampling, about one-third of the data is not used to train the model and can be used to evaluate its performance. These samples are called the out of bag samples. It is very similar to the leave-one-out cross-validation method.
Like 1/3 of the data which is not used by random forest will be evaluated, around 33%–36% of the data is not used.We will then check the error rate and then again used this algorithm to get min error, We will not stop until the error rate will not decrease.

Extra information for Random Forests

RANDOM FOREST produces two parameters number of variables and number of trees for internal structure.
Variable importance
Random forest estimates the importance of the variable by looking at the OOB score by letting other variable unchanged.Because important features tend to be at the top of each tree and unimportant variables are located near the bottom.
Proximity measure
The term “proximity” means the “closeness” or “nearness” between pairs of cases.Proximity of two cases is the proportion of the time that they end up in the same node.Proximities are calculated for each pair of cases/observations/sample points. If two cases occupy the same terminal node through one tree, their proximity is increased by one. At the end of the run of all trees, the proximities are normalized by dividing by the number of trees. Like similar cases should be in same node rather than dissimilar ones.

RANDOM FOREST AS REGRESSION

Random Forest solves the instability problem using bagging as it will take the average in regression as compare to classification it count the number of votes.The random forest model is a type of additive model that makes predictions by combining decisions from a sequence of base models. More formally we can write this class of models as:

g(x)=f0(x)+f1(x)+f2(x)+…

where the final model g is the sum of simple base models fi.Here, each base classifier is a simple decision tree.
Note there is a difference between regression and classification 
The default try for the regression is m/3 as compare to classification,While the node size is 1 as compare to classification, so there is only one measure variable importance instead of four.

RANDOM FOREST AS UNSUPERVISED LEARNING

A random forest predictor is an ensemble of individual tree predictors. As part of their construction, random forest predictors naturally lead to a dissimilarity measure between the observations. One can also define an RF dissimilarity measure between unlabeled data: the idea is to construct an random forest predictor that distinguishes the “observed” data from suitably generated synthetic data. The observed data are the original unlabeled data and the synthetic data are drawn from a reference distribution.Suppose we have two classes “A” and “B”, by using A we will build B temporary data, then we will try to classify it with random forest.Class “B” will be produce from independent bootstrapping of each variable.
The basic purpose is to get the similar data points on the same terminal of the node of a tree.However we can use Proximity=True option of random forest.Thus, this proximity matrix can be taken as similarity matrix.Since we are using multiple decision trees, the bias remains same as that of a single decision tree. However, the variance decreases and thus we decrease the chances of over fitting.

Reference:

http://cogns.northwestern.edu/cbmg/LiawAndWiener2002.pdf

Contact:
Muhammad.saadzaheer991@gmail.com