Machine Learning — Beginner’s Guide to Random Forest Classifiers (The Maths)

Published in

CodeX

3 min readSep 3, 2021

If you’re interested in different machine learning techniques, then you’ve probably come across the random forest algorithm before. In this post I’m going to go over the mathematical principles behind it, and in another post I will go through how to implement them in python.

Random forests have been known to perform incredibly well considering their simplicity. In many coding competitions they often hold up very well even against neural networks, very impressive! They are definitely a must know on your journey through machine learning. Random forests are based around ‘decision trees’, so that’s where we will begin.

Decision trees are the basis for random forest classifiers. Decision trees work by taking the n-dimensional predictor space, and recursively splitting it into partitions that maximise ‘purity’ within a partition. Purity is the number of predictions that are the same in a partition and is measured using the ‘Gini Index’. For the 𝑚𝑡ℎ divided partition, the Gini Index for region 𝑅m is calculated using the formula:

where 𝐾 is the total number of prediction classes. The Gini index measures purity because it calculates the probability of an incorrect classification, based on the data within the chosen split region. Finding the point along one of the predictor axes to make a split means finding the maximum purity, and this is done using the concept of quality, which has formula:

where 𝑁1, 𝑁2, 𝐺1, 𝐺2 are the number of points and corresponding Gini Index for region 1 and 2 after sampling a split; the optimal split point is found using:

The quality for each predictor is found, and the minimum quality from all the predictors is where the split will be made. This is repeated iteratively until a condition to stop splitting is met. Each split is known as a node/leaf, and the number of splits made is known as the depth and are both potential conditions to stop the splitting if a maximum number of nodes, or maximum depth is specified.

Okay so that’s the mathematical principles of decision trees, there’s really not a lot too it, but it’s important to get to grips with it. So with this in mind, lets use it to understand how random forests work, and why they are superior!

The random forest classifier is known as an ensemble, supervised learning algorithm. Ensemble methods are often preferable to individual learning methods, due to using multiple individual learning methods. The purpose of this is to reduce the variance resulting from using a single technique, and produces a less overfitted model overall. The random forest classifier can achieve better results than a decision tree because it takes multiple decision trees known as branches, hence being classed as an ensemble method, and takes the majority prediction from the branches as the result.

So there’s not really any extra maths to go from an individual tree, to an ensemble of trees. The only concept you need to be clear on, is that it takes the majority class vote, from multiple decision trees.

And there you have it! That’s the maths that makes random forest classifiers work. It’s definitely worth going through this before putting them into practice, you’ll gain a much better understanding of the results you get this way.

I hope this helps you learn about the background of this impressive machine learning technique!

Machine Learning — Beginner’s Guide to Random Forest Classifiers (The Maths)

Written by Tom Clarke