Understanding the Random Forest algorithm.

Anirudh Palaparthi
5 min readJan 28, 2020

--

It’s not data science anymore, instead, it’s data sense. Making sense out of data is data science.

Machine learning comprises supervised and unsupervised learnings, of which supervised comprises of classification and regression. We will be discussing more about classification, following which we will be discussing about ensemble methods.

Supervised learning is where you have input variables (x) and an output variable (Y) and you use an algorithm to learn the mapping function from the input to the output Y = f(X) . The goal is to approximate the mapping function so well that when you have new input data (x) that you can predict the output variables (Y) for that data.

Unsupervised learning is where you’ve input variables (x) and no output variable(Y). The goal is to cluster the input variables accordingly, so that when you’ve new input data(x) you can know which cluster the new x belongs to.

Supervised learning models can be further grouped into classification and regression. Both problems have as goal the construction of a succinct model that can predict the value of the dependent attribute from the attribute variables. The difference between the two tasks is the fact that the dependent attribute is numerical for regression and categorical for classification.

A simple breakdown of machine learning work flow.

A classification problem draws conclusion from observed values.When the target column is categorical, a classification model tries to predict the value of one or more outcomes. In short Classification either predicts categorical class labels or classifies data based on the training set and the values in classifying attributes and uses it in classifying new data.

Classification models include logistic regression, decision tree, random forest, Naive Bayes are some of them.

Don’t get intimidated by all the “technical” words. Getting to know the literal meaning of the word gets you at least a 20% understanding of the concept.

To know more and dive deep into random forest and to know what is extrapolation, please go through the article:

https://neptune.ai/blog/random-forest-regression-when-does-it-fail-and-why

Now, let’s build a simple classification model on the titanic dataset and then go for Random Forest algorithm and build the model using the RFC on the same data.

Decision Tree: The foundation.

To understand the concept of RandomForest model, we need to learn about how a DecisionTree model works. A DecisionTree model is a classification model(refer to the above piece of code for a DT model), which helps in making business decisions. While performing analysis, a decision tree can be used to visually and explicitly represent decisions and decision making. As the name goes, it uses a tree-like model of decisions. Look at the following image for how a decision tree is built on the titanic dataset.

Source: https://tfbarker.wordpress.com/tag/random-forests/

A decision tree has its root on the top and branches at the bottom. It’s upside down than a usual tree. The image above depicts a decision tree based on the root Sex column. The tree has its branches divided, based on if the passenger is a male or female, and then comes the pclass the passenger belongs to, which is further branched out if the passenger survived or not based on their pclass.

Let’s look at the basic terminology used with Decision trees:

  1. Root Node: It represents entire population or sample and this further gets divided into two or more homogeneous sets.
  2. Splitting: It is a process of dividing a node into two or more sub-nodes.
  3. Decision Node: When a sub-node splits into further sub-nodes, then it is called decision node.
  4. Leaf/Terminal Node: Nodes do not split is called Leaf or Terminal node.

There are two ensemble techniques: Bagging and Boosting. RandomForest is based on the bagging technique. So, to understand what a random forest is, let’s look at what bagging is.

Bagging is an abbreviation for “bootstrap aggregating”. It takes some subsamples from the initial dataset and trains the predictive model on those subsamples. The final model is obtained by averaging the “bootstrapped” models and usually yields better results.

The main advantage of this technique is that it incorporates the regularization in it and all you need is to choose good parameters for the base algorithms. Averaging the models leads to eliminating (or, at least, improvement) for the unstable models which can be produced from biased data.

Now, imagine a single tree and a forest(group of trees). The single tree is what we call DecisionTree and the forest is what we call a RandomForest(ofcourse, it’s in the name). A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy( same as the concept of bagging) and control over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement.

The training algorithm for random forests applies the general technique of bootstrap aggregating, or bagging, to tree learners. Given a training set X = x1, …, xn with responses Y = y1, …, yn, bagging repeatedly (B times) selects a random sample with replacement of the training set and fits trees to the samples.

The above procedure describes the original bagging algorithm for trees. Random forests differ in only one way from this general process. They use a modified tree learning algorithm that selects, at each split in the learning process, a random subset of the features. This process is sometimes called “feature bagging”. The reason for doing this is the correlation of the trees in an ordinary bootstrap sample, if one or a few features are very strong predictors for the response variable (target output), these features will be selected in many of the trees, causing them to become correlated.

The below code is how you use sklearn to import a random forest classifier and build a model on it.

CONCLUSION:

The random forest is a classification algorithm consisting of many decisions trees. It uses bagging and feature randomness when building each individual tree to try to create an uncorrelated forest of trees whose prediction is more accurate than that of any individual tree.

The following is what we need to make out random forest model better/ accurate:

  1. Features with predictive power.
  2. No(low)-correlation between the trees of the forest. The algorithm will work on the low correlations on our part.
  3. The features we select and the hyper-parameters we choose impact the model a lot.

Thanks for reading. Hope you’ve learnt something while reading as I’ve learnt while writing it.

Upcoming posts by me on machine learning, deep learning and NLP:

  1. Information Gain.
  2. Text extraction from pdf and image files.
  3. Building your first neural network.
  4. Topic modelling.

Hope to see you in the above posts too. Have a nice time ahead.

--

--

Anirudh Palaparthi

Machine learning specialist by profession. Photographer by passion.