Machine Learning- Decision Trees and Random Forest Classifiers

Karan Kashyap
Analytics Vidhya
Published in
13 min readOct 21, 2019

Classification (the ability to identify the group to which an object, animal, etc. may belong to given some information about it) is one of the most important and widely used tasks that we try to carry out using machine learning. In order to work on machine learning projects, ranging from identifying diseases using scans of different body parts to identifying an email as being spam or non-spam, classification is an essential concept that one must be well aware of. Many of the advancements being made in the field of machine learning are taking place in the area of classification and thus, it is a fundamental concept that everyone studying and using machine learning must be able to understand and perform.

There are many algorithms that one can learn and use in order to perform classification. In this article, I will focus on a type of machine learning model that I feel is highly underestimated and not used widely enough given its tremendous potential. I am referring to the machine learning constructs of Decision Trees and Random Forest Classifiers.

Some key concepts in Machine Learning

Decision Trees:

Let’s start by understanding what decision trees are because they are the fundamental units of a random forest classifier.

At a high level, decision trees can be viewed as a machine learning construct used to perform either classification or regression on some data in a hierarchical structure. In this article, I will only discuss the use of decision trees for classification.

Decision trees use machine learning to identify key differentiating factors between the different classes of our data. By doing so, decision trees can take some input data and predict a class by running the data through a set of differentiating questions that it forms using machine learning.

The questions that are formulated are yes-no questions and after each question, there are two paths- the “yes” path and the “no” path (almost like the ‘if…else’ construct that we use commonly while writing code). Each of these paths will either lead to the next question or to a final output.

There is some key terminology that one must be acquainted with before learning about decision trees and random forest classifiers:

  • Classes: When we are performing classification, the list of different groups that the data can belong to are known as the classes.

Eg: If we take a dataset consisting of images of handwritten letters and we want to classify those images by identifying which letter has been made in the image, the different classes will be:

A, B, C, …, X, Y, Z

  • Nodes: The various questions formed by the decision tree are known as the nodes. These can be viewed as points in the decision tree where we form a split (or branches). Each of these is a ‘yes’ or ‘no’ question and once this is answered, we move one step closer to identifying the class that the data belongs to.
  • Leaves: Due to the presence of nodes, there are many alternate pathways that get created in any decision tree with even a few layers of nodes. The end points of each of these pathways is known as a “leaf”. These leaves represent the final value or class that will be predicted for the given input data.

The image below depicts a highly simplified view of what a portion of a decision tree, that is being used to classify an animal as being either a bird, a dog or a fish, could look like. In this diagram, the white boxes are the nodes and the green boxes are the leaves.

The machine learning algorithms will be used to analyse the three different classes and to formulate the questions necessary to differentiate between the different classes, in this case:

“Does it swim?”

and

“Does it have four legs?”

Once these yes-or-no questions are answered, we know for sure which class our data will belong to, i.e. which animal it is.

Decision trees, then, are a fairly simple concept to grasp as they are extremely intuitive, but let us use another example to consolidate this concept.

In this example, our dataset is:

A A A A A A B B B 1 1

The initial question asked is “Is it blue?” because the model understands that colour is the single biggest differentiating factor in our data. Based on the separation formed, we have blue ‘A’s at one end and a set of purple ‘B’s and ‘1’s at the other. At the purple end, the model asks another question- “Is it a letter?”. Once this has been answered, we split the purple data into ‘B’s and ‘1’s. Now, we have obtained all 3 leaves of our model, each representing data belonging to a separate class. We have successfully classified our data!

Needless to say, in reality, our data is never this simple, but the general concept used to classify the data remains the same.

The work performed by a decision tree algorithm in generating the question at each node can be understood as follows:

What is a question that I can ask so that on splitting my data based on the answers, the data within the two groups formed are as similar to each other as possible and so that the two groups themselves are as different from each other as possible?

The better our model gets at framing and answering this question, the better our predictions are (leading to a higher accuracy).

How do we actually make decision trees?

The most widely used algorithm when actually constructing decision trees is known as ID3. The ID3 algorithm makes use of two concepts known as ‘entropy’ and ‘information gain’ to design the decision tree. I explained earlier that decision trees try and split the dataset in such a manner that the data present within each group is as similar to each other as possible while the data present in one group is as different from the data present in the other groups as possible. In order to achieve this, we use the concepts of Entropy and Information Gain.

Entropy: Usually, the term ‘entropy’ is used to refer to a measure of disorder in a given system. With decision trees, entropy can be understood as a measure of the homogeneity of a sample of a group of data. Usually, entropy is calculated on a scale of 0 to 1. An entropy of 0 indicates that the data has minimal disorder (it is very homogeneous/pure), whereas an entropy of 1 (or a high value of entropy) indicates that there is a lot of disorder in the data.

The formula for calculating Entropy is:

Where ’n’ is the number of classes and pi is the probability of a data point in our group belonging to the class ‘i’.

Let us assume that we have a dataset which contains data points belonging to one of two different classes- 0 and 1. We have 100 data points out of which 40 belong to class 0 and 60 belong to class 1. The entropy of this dataset will then be:

This indicates that such a group of data has a very high entropy.

Each time we split our data to form a new group of data, the entropy of that group of data can be calculated. The lower the entropy, the closer we are getting to forming groups with unique data points, i.e. as the entropy reduces, we are getting closer to being able to successfully classify our data.

After understanding the concept of Entropy, it becomes clear that a node with an entropy value of 0 is what is considered to be a leaf (it has data points belonging to only one class).

Information Gain: Now that we have found a way to measure the disorder of the dataset in the form of Entropy, we must also ensure that we have a way to ensure that we are reducing the entropy by splitting the data with the questions we ask. Information Gain is what we use to ensure that our entropy is reducing. Thus, information gain can be understood as a measure of the decrease in the amount of disorder.

In the decision tree, at any node, our model virtually splits the data based on all the possible attributes/features that can be used to split the data. Each time we split based on an attribute, the entropies of the two created subsets of the data are calculated and added proportionally to obtain the total entropy of the split. This total entropy of the split is then subtracted from the entropy of the data present at that node before the split to obtain the information gain for that split.

Once we obtain the information gain after splitting the data according to all possible attributes, we find the attribute that leads to the highest information gain and use that to form the question and split our data.

Here:

  • X: The target variable (the data points present at that node)
  • A: The attribute on the basis of which this split has been formed
  • E(X): The entropy of the data at the node before the split
  • E(X , A): The weighted sum of the entropies of the two branches formed after the split based on the attribute ‘A’.

Thus, our goal in the ID3 algorithm can be understood as finding the questions to be used to split the data so that we can minimize entropy and maximize information gain. The process described above is repeated until we manage to separate all different kinds of data from each other and only leaves containing one type of data each are left.

Random Forests- An Ensemble of Decision Trees:

Despite all the merits of the decision tree model, there is a limit to how accurately one individual decision tree can perform classification. Thus, rather than using just one decision tree, very often, we tend to create a set of decision trees that perform classification. Such a model is known as a ‘Random Forest Classifier’ and it yields a significantly higher accuracy than that of an individual decision tree.

A simple way to comprehend this model is to interpret its name in a very literal sense. A forest can be viewed as a collection of trees. This suggests that a Random Forest model will consist of various decision trees. The reason these are called “Random” is because each decision tree in the forest is trained using a random subset of the training data. This technique of training each tree in the forest using a different, random sample of the data is known as Bagging (also known as Bootstrap aggregation). In typical Random Forest models, a randomly sampled subset of the data on which classification is to be performed is passed through each of the decision trees in the forest in order to train that tree. Later, when we are performing classification for a datapoint, each tree constituting the forest gives a predicted class (label). The final prediction is then decided by measuring which prediction was made by the most number of trees in the forest.

A Simplified View of a Random Forest Classifier

One of the main reasons Random Forest Classifiers are so effective is because each tree in the forest is unrelated to the other trees in the forest to a great extent, a condition achieved due to the random sampling of data for the training of each individual tree (bagging). As a result, some of the trees might give wrong predictions for the final output class, but since we are taking the label predicted by the greatest number of trees, we get a fairly high accuracy as most of the trees are likely to give the right prediction. Thus, the use of a number of unrelated decision trees helps reduce the percentage of error that we make since the mistakes made by one tree are not usually made by the other trees as well. Thus, since we are taking predictions from a number of sources as opposed to just one source, our accuracy increases by a significant amount.

This state of largely independent decision trees that is attained due to the random data sampling of data is what helps ensure that as a whole, the model continues to move in the right direction and maintains a fairly high accuracy.

Another advantage offered by using random forest classifiers is that they help us to eliminate problems that can be caused by outliers in our dataset. An outlier can be considered to be any datapoint that when plotted on a graph is far away from the rest of the points that are similar to it. With most machine learning models, such outliers have a fairly large impact on the model and reduce the overall accuracy of the model. With a random forest classifier, we significantly reduce the problem caused by such outliers. Since we are taking a random sample of the data for each tree, in any dataset of a fair size, while there will be some trees that will be trained with the outliers that might consequently produce wrong predictions, most of the trees will not have been trained using the outliers and thus, the overall prediction we receive will still be correct.

Random forest classifiers act as an ensemble. Since all of the decision trees that constitute the ‘random forest’ are largely independent of each other due to the random sampling of data points (bagging), it ensures that the predictions that we make on the basis of the majority vote obtained from the trees have a higher accuracy than a single prediction obtained from any one tree. This is also one of the main reasons that random forest classifiers have a much higher accuracy than other classifying models operating on the same dataset.

Another technique we use with random forest classifiers is known as ‘Randomizing Attributes/Features’. As I have explained earlier, decision trees have a set of attributes/features on the basis of which they can split the data at a node. In random forests, rather than giving each decision tree access to all of these attributes, each decision tree is given a random subset of the attributes. This helps as it leads to an even greater degree of unrelatedness of the individual decision trees, which increases the overall accuracy because the use of different features/attributes in different trees help enable most decision trees to overcome the errors (ensure that they do not make the same error in the prediction) made by other decision trees.

While I have discussed some of the key advantages of such models, they do have two main drawbacks:

  • Generating lots of trees can be a time consuming process since we have to calculate entropies and information gain many times.
  • When the data we have is very noisy, these models often have a tendency to overfit (overfitting is when for some reason, our model is trained in such a manner that it becomes very good at classifying only that data which is present in our training set but is poor at classifying other data of the same type, i.e. the model only learns to classify the specific examples that have been used to train it).

Conclusion:

In conclusion, it is safe to say that Random Forest Classifiers are an extremely powerful and effective classification model. They are largely underestimated by many and I hope that after reading this article, you have been able to understand all the concepts explained on the functioning of random forest classifiers and also concur that they are a reliable model that can be used in a number of scenarios. Some of the key advantages of these models that we have discussed are:

  • They are based on decision trees that are an extremely easy concept to understand given their highly intuitive nature.
  • These models do not usually need large amounts of training data to reach a stage where they can predict classes with a high accuracy.
  • Outliers (and adversarial examples) have a reduced effect on such models as compared to other machine learning models.

I intend to post more articles on various machine learning concepts in a way that is easy for everyone, even beginners, to understand. Watch out for my next article on the practical usage of random forest classifiers where I will explain the code I have written to perform classification on a dataset using TensorFlow.

--

--