Introduction to 3 useful tree based models

Gaurang Mehra
4 min readMar 3, 2023

--

Our objective for this article is to give a beginner friendly introduction to tree based models starting with Decision Trees and moving to more complex ensemble algorithms like Random Forests.

All the models we will discuss today are supervised learning models. Supervised learning algorithms are trained on labelled datasets. The labels for a supervised learning algorithms could be categorical (eg. cancer/benign) or a continuous value (eg. price). Once the model is trained, it is fed unlabeled data and predicts the labels. Now lets look at Decision Trees.

Decision Trees

A Decision Tree is supervised learning algorithm that uses a series of if/then questions to arrive at a predictive label. To further understand this concept lets look at a Decision Tree.

Fig 1.1 Decision Tree example (depth=2)

In the figure above we look at an example Decision Tree model that predicts whether a patient has diabetes or not given glucose levels, BMI, age etc. For purposes of illustration this tree is a very simple tree with a depth of 2. We will traverse the tree in the direction shown and in the process illustrate a few important concepts

  • The 1st node in the tree is the root node. We can think of this as the first in a series of if/then questions that the model uses to arrive at a final prediction. This first split is generally on the most important feature. In this case the feature is glucose levels. If glucose levels are less than 127.5 then the model proceeds in one direction, if glucose levels are greater than 127.5 the model proceeds in another direction. This is known as a binary split.
  • The nodes after the root node but before the terminal nodes are called decision nodes. We can think of them as a series of intervening questions that help the model progressively learn how to predict better. In our example the model checks if age is below 28.5 after checking the glucose levels, if yes then it classifies the instance or the record as no diabetes.
  • The final set of nodes with the decision or the predicted label are called the leaf nodes.
  • We have to specify the depth of the tree before fitting the model to the training data. Trees that are too deep tend to overfit the data, essentially learning the noise in the training data and performing poorly on unseen data. One way to avoid this is to run the model repeatedly with different max_depth parameters and then pick the model instance which gives the best accuracy on unseen test data. I have another article here which has details on how to implement this in python.

Ensemble Learning Methods

Now that we know Decision Trees we move on to ensemble learning methods. Typically ensemble learning involves either fitting one model to multiple samples drawn from the training data or different models fit to the same training data set. Lets look at the first instance, fitting different models to the same data set.

Voting Classifier

Fig 1.2 Voting Classifier

In a Voting Classifier the same training set is used to train multiple learners or models. Each model makes its own prediction and the final prediction is decided by majority voting in the case of classification and an average in the case of regression tasks. Typically we would choose models that are not prone to the same kind of errors. In that case the Voting Classifier performance (accuracy, recall etc.) is likely to be better than an individual model prediction.

Bagging Classifier/Model

Fig 1.3 Bagging Classifier

A Bagging classifier uses a different approach. To understand a bagging classifier we need to understand the concept of bootstrap samples/bagging samples.

A bootstrap sample is drawn from a dataset with replacement. If a dataset is of size n then the bootstrap sample is also of size n. If we look at the figure above the bagging/bootstrap sample 1 and sample 2 are slightly different but of the same size as the training data. Since each sample is randomly drawn, each sample is different.

The Bagging Classifier model pulls k bootstrap samples from the training dataset and fits a model to each of the k sample datasets. The final prediction is made by combining the predictions of k models made on the k bootstrap samples. Bootstrap samples are not subsets of the training data but are of exactly the same size so each model is trained on a full size but slightly different training dataset.

The Bagging Classifier works best with models that have high variance like Decision Trees as each Decision Tree is fit to slightly different version of the data (bootstrap sample) which can help reduce variance by averaging out the noise across datasets. This also reduces overfitting to the training data and improves performance on unseen data versus a single model trained on the single training dataset.

In a similar way to a Bagging Classifier you can have a Bagging Regressor.

Random Forests

We can extend the concept of the Bagging Classifier/model by not only using k randomly drawn bootstrap samples from the training data but also randomly sampling a subset of features at each node. This further level of randomization helps reduce variance and overfitting. You can see an implementation of this in python here.

These 3 models are very powerful and can be used to perform a variety of Machine Learning tasks. Unlike deep learning models they do not require huge amounts of training data.

--

--

Gaurang Mehra

Deeply interested in Data Science, AI and using these tools to solve business problems.