Machine Learning — Beginners Guide to Random Forest Classifiers (The Code)

Published in

CodeX

4 min readSep 3, 2021

So if you haven’t already checked it out, I have posted about the mathematics behind this machine learning technique. If this is the first time you’re coming across this algorithm I recommend you give it a read before jumping into the code. Otherwise, we’re going to jump right into it!

So as always we will import the packages that we need to get started:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

From here we will generate a quick dataset to train our models with. This is going to be a very simple dataset, not really a good reflection of real world data, this is purely to demonstrate how to set up the models.

To generate our data is very easy:

x,y = make_classification(n_samples=1000,n_features=2,
                          n_informative=2,n_redundant=0)

So from this line of code, we have our input data stored in the ‘x’ variable, which will be a (1000,2) matrix, and the corresponding class responses held in ‘y’, which is a 1000 item long list. From here we want to split this into training data, and testing data.

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.25)

This has divided up our data into 75% for training, and 25% for testing the model. Before we create the model, lets display it on a plot to see more clearly what we are working with.

plt.scatter(x_train[:,0],y_train[:,1],c=y_train,cmap='bwr')
plt.plot()

After running that script you should get an output that looks somewhat similar to this:

Okay so the data already looks quite easy to split up, so our model should perform well! For a dataset like this a machine learning algorithm really isn’t necessary but it is more interesting for sure.

Before we build implement the random forest classifier, lets build a decision tree classifier, because they are easier to visualise.

tree = DecisionTreeClassifier(random_state=0,max_depth=5)
tree.fit(x_train,y_train)

That is all the code required to build and train a decision tree! But it doesn’t really help us see how it’s performed, or even what it’s actually done. So to help with that, lets create a mesh-grid over the data, and use our model to predict every point, and the result should be a few decision boundaries drawn over our graph.

Min,Max = np.min(x_train),np.max(x_train)
x1 = np.linspace(Min,Max,100)
x2 = np.linspace(Min,Max,100)
x1g,x2g = np.meshgrid(x1,x2)
X = np.array([x1g.ravel(),x2g.ravel()]).T
Y = tree.predict(X)
plt.pcolormesh(x1g,x2g,Y.reshape(x1g.shape),cmap='RdYlBu_r')
plt.scatter(x_train[:,0],x_train[:,1],c=y_train,cmap='bwr')
plt.show()

So what we’ve done here is just create an array of points over our data, and feed them into the decision tree model, with the predictions as the output. With the results we have plotted a colour mesh over top of the training data plot, and it should look something like this:

You can clearly see where the decision boundaries have been made for the two classifications. If you set the ‘max_depth’ parameter to a higher number, then you should get a more boundaries, but if it is set too high, then the model will be overfit to the training data. What we should do is calculate the error of the model when it is given the testing data.

y_pred = tree.predict(x_test)
error = (y_test!=y_pred).mean()
print('Error = ',error)

I recommend playing around with the ‘max-depth’ parameter and seeing the effect on this error.

Right, so we have a decision tree, and we can visualise how it categorises the dataset. Lets move on to the random forest.

The code is almost identical to the decision tree model, except we can’t visualise it in the same way because there are a number of trees all working together.

forest = RandomForestClassifier(n_estimators=100,max_depth=5)
forest.fit(x_train,y_train)
y_pred = forest.predict(x_test)
error = (y_test!=y_pred).mean()
print('Error = ',error)

That is our random forest in the nutshell. The ‘n_estimators’ parameter, is how many decision trees we use, and the majority class prediction is the output of the model. There are a few parameters you can play around with in the random forest, so I recommend reading the sklearn documentation on them, and you can experiment for yourself.

You should definitely compare the error differences between the two models we have trained, and most of the time you will see that the forest outperforms the trees. This is why ensemble methods are generally chosen over solitary methods.

If you enjoyed training your first basic decision tree or random forest I recommend finding some other datasets to try them out with. From there you can explore the effects of changing the parameters, or even look into some optimisation algorithms!

Machine Learning — Beginners Guide to Random Forest Classifiers (The Code)

Written by Tom Clarke