I’ve set myself a challenge to do more technical writing this year, in the hope that these articles may help anyone else at a similar stage of learning. Whilst at the same time this reinforces my own learning, it’s difficult to write about something you don’t understand. If you don’t agree with what I’ve written or think it’s just wrong, please do feedback to me as it’s a continual learning process.
I’ve been using various machine learning techniques with Kaggle competitions, both shallow and deep. One technique that I’ve found very effective is Random Forests. This article describes my learning and understanding of this exceptional machine learning technique.
A significant proportion of this learning and this article has come from the excellent fast.ai Machine Learning course originally taught at University of San Francisco as part of the Masters of Science in Data Science curriculum by Jeremy Howard.
Ensembles of decision trees
Random forests are one of a group of machine learning techniques called ensembles of decision trees, where essentially several decision trees are bagged together and take the average prediction. The other very effective technique in ensembles of decision trees group is Gradient boosting, which is in the boosting category rather than bagging category.
What is known today as Random Forests was first published in a paper by Leo Breiman in 2001:
Why Random Forests are so good
- No statistical assumptions
- Resistant to overfitting
- Works well with high numbers of features
- Works well with almost any size dataset
- Good with limited/small datasets
- Require little feature engineering
- Optimal hyper parameters are easy to find
Random Forests are a supervised machine learning model and are a technique that will make accurate predictions across almost any tabular data set including categorical or continuous data. Even with huge data sets they work well regardless of the number of features or number of rows in the dataset. Some practitioners go as far to call Random Forests a universal machine learning technique.
No free lunch theorem
There’s a broadly held view in the data science and machine learning community that there’s no free lunch for machines learning techniques.
This is based on a Mathematical theorem that there is no type of model that works well for any type of dataset. Looking at any random dataset is by definition random. However, in the real world the data we attempt to model isn’t random, it is created by causes or relationships with structure. Hence the datasets we work with are not completely random.
There are machine learning techniques that work better than most other techniques most of the time. Ensembles of decision trees are one of those techniques. Random forests sit within this group and are at the top of that group for working better than other techniques most of the time.
Features and the Curse of dimensionality
Another commonly held view is the curse of dimensionality. As the dimensionality of the feature space increases, the number of configurations can grow exponentially and thus the number of configurations covered by an observation decreases.
However this only applies to certain techniques such as Support Vector Machines and not other techniques such as Random Forests and K Nearest Neighbours.
With more and more features in the data this creates a space that’s more and more empty. This means with more dimensions more of the points sit on the edge of that space. Each dimension you add to the data makes it multiplicatively less likely a point isn’t on the edge of at least one dimension. In very high dimensions everything sits on the edge.
The theory and concern is that the distance between points is less meaningful because more of the points are on the edge and this will means the training won’t work. However, the points still do have different distances from each other, even though they are on the edge. They still vary in how far they are away from each other.
Random Forests work very well in high dimensions. Other techniques such as K nearest neighbors also work really well in high dimensions
With Random Forests there’s almost no harm in keeping columns whose importance is not certain and no harm in adding more columns. For example, if one of the features was a date time then this could be split into multiple features such as: day of week, is a bank holiday, days since last bank holiday, month, days until Christmas and so on.
Resistant to overfitting
Overfitting the data in machine learning is a very common problem, where your model doesn’t work well on the test data set, validation set and in production. A model that overfitting is said to have high variance.
Random Forests aren’t prone to overfitting and if they do it is generally easy to stop by adjusting the hyper parameters such as minimum samples per leaf and maximum depth. There is more detail about the hyper parameters further on.
Good with limited data
If you have limited rows in your data set you can use Random Forests without a separate validation set as you can see how they are generalising with only the one training data set. This is by using the out of bag score, there are more details about this further on.
No statistical assumptions
Random Forests also have the advantage you only need to make few, if any, statistical assumptions. Random Forests don’t assume your data is normally distributed, or that relationships are linear, nor that you’ve specified the interactions between the features.
Random forests requires only very small amounts feature engineering other than the usual categorical data numericalisation and time series data argumentation. There’s no need to apply functions such as log transforms to the data.
Optimal hyper parameters are easy to find
As the number of trees increase it’s easy to see the accuracy improvement levelling off.
A Grid search can also be used to find the optimal hyper parameters.
There are many ways the prediction from a Random Forest model can be explained. For examples, it is very easy to see the feature importance within a Random Forest as the diagram below generated from a Random Forest model shows:
There are many other ways to interpret the Random Forest model such as partial dependence between features.
How Random Forests learn, growing a decision tree
Creating the first decision tree
To find the first decision split the tree needs to know the single most effective split. Every feature is tested and every possible value of each feature is tested to determine which feature and which value of the feature provides the best score to split on first in the tree. This might seem computational expensive, but see the performance section further on.
The score is the weighted average of the mean square error of the prediction each of the two groups it creates from the split.
The tree keep splitting further until every leaf node only has one feature in it or it is limited by a hyper parameter such as the minimum samples per leaf or maximum depth is reached.
Creating a forest of trees (Bagging)
Bagging is the process of creating a forest of decision trees resulting in a random forest that is more accurate than any individual tree.
Within a random forest many deep trees are created each trained on a random subset of the date. Each tree will overfit on its random subset of the data in different random ways. Each of these trees may fit the subset of the data perfectly although could have a huge error on the remainder of the data. This means each tree is very much overfitting but each tree on different features.
For a dataset with n rows if the random subset is n rows are taken with replacement, this results in each tree seeing a random appropriately 63.2% of the data rows with many rows will be represented multiple times. This is the default the Scikit-learn library uses.
This results in multiple uncorrelated models being created and each will have found different insights about the features.
The data to perform a prediction upon is run through each tree to the leaf node reached by the decisions within the tree, then average them all together. For a prediction that is continuous value, the prediction is the average of the dependent variable in that leaf node. By taking the average of the trees’ predictions, this is a technique for ensembling.
Each of the trees predictions will be better than a random prediction as they have a real random subset of the data. This means each tree has found an element of insight on the features. Although each tree has errors, these errors are random and uncorrelated. Mathematically the average of a group of random errors is zero, so these errors will average out to zero and what’s left is the true relationship.
In practice a machine learning model that is effective is a model that makes good predictions and one that doesn’t make meaningless predictions. To accomplish this and be effective a model is needed that’s both accurate on training data finding the relationships and also generalises when shown new data.
With Bagging each individual estimators should be as predictive as possible but also as uncorrelated as possible. It is most important to be create uncorrelated trees rather than more accurate trees, which are less predictive on their own and also less correlated with each other.
It’s a good idea to start training with a small number of trees evaluating how the other different hyper parameters and any feature engineering affects the loss. These training cycles should be quick enough that they can be evaluated without leaving to run for a long period. The kind of feature engineering may be for time series data and augmentation with other datasets. This can be evaluated quickly with a small number of trees, for example 20 or 30 trees.
Then the number of trees can be increased to a larger number, for example 1000 trees, leaving the mod to train for several hours.
Note in some packages the trees are called estimators.
Out of bag score
When validating during training, the rows not seen by that tree can be used as a validation set.
The out of bag score is the average prediction loss on the data rows not seen by each tree during training. As long as you have enough trees, every row will appear at least one in the out of bag rows evaluated.
Out of bag score, when validating during training, rows can be used that have not been seen by that tree. Average all of the trees where that row was not used for training.
If the dataset is small you may not want to take a percentage as a validation set. Using the out of bag score is a solution to this. This is almost unique to random forests where the out of bag score can be used.
The accuracy of the out of bag score is likely to be lower on average as each row is seen by less trees. Will be an underestimate, less so the more trees you add.
Training on very large data sets
Random Forests will scale to train on almost any size dataset. If each tree is using a random subset of the data with enough trees the random forest will see everything in the data even with millions of rows. By each tree having a different random subset, the subsets can be smaller and this will speed up training the random forest.
Most of the insights in the features normally can be found from a small number of trees on a small subset of the data. That subset of the data needs to be large enough of a sample size to get an accuracy within a reasonable distance of the best possible accuracy. In many cases this results in a model within a reasonable degree of the accuracy you need and that much is much faster to train and run.
Training random forests may seem a slow technique to train however modern computers’ CPUs are fast and capable of billions of clock cycles per second with multiple cores each capable of SIMDI (single instruction multiple data). GPUs take this even further with trillions of clock cycles per second.
Common Hyper parameters
Number of trees
A key hyper parameter is the number of trees in the forest. As many trees as significantly reduce the error should be used. To determine this the r squared error can be observed and plotted against an increasing number of trees. The prediction is never going to get worse, although a threshold will be reached where adding more trees doesn’t reduce the error significantly and will slow down training.
The Scikit-learn library’s default is 10 trees or estimators as it refers to them.
The maximum depth of the tree is another hyper parameter, this sets a limit on the depth of the nodes that are allowed in each tree.
As you increase max depth you increase variance and decrease bias causing the model to overfitting more.
The Scikit-learn library’s default is no maximum.
Minimum samples per leaf
Another hyper parameter is minimum samples per leaf. This allows the tree to stop before it reduces down to one leaf. It specifies the minimum number of samples required to split a node into a leaf:
For example with an average of 3 samples per leaf each tree will generalise better but each tree will be slightly less powerful on its own.
Values such as 1, 3, 5, 10, 25 tend to work well with most datasets. With very big data set min samples leaf may be in the thousands. To decide look at how big are the sub samples in use and decide number of samples per leaf based on that.
As you increase min samples leaf you decrease variance and increase bias, causing the model to underfit.
The Scikit-learn library’s default is 1 sample per leaf.
The number of features to consider when looking for the best split. The more features that are regarded when looking for the best split will result in more correlation between the trees. Random Forests want less correlation between the trees and less correlated the better.
With more correlated trees one dominant feature would mean that one feature is most predictive and would always be the first split resulting in all the trees would being similar. It may be that a combination/interaction of columns/features may be more predictive/important and that insight would never be found.
After each split you check a different subset, never removing the features to be split on (with replacement).
The Scikit-learn library’s default is all of the features. Using half of the features works well most of the time. Other good values to try are log2 and the square root of the number of features.
Extremely randomised model
There is a another technique called the Extremely randomised trees model where rather than the every split of every feature/variable, instead the technique tries a few random splits of a few random variables. This makes it extremely random and much quicker to train as it contains more randomness. This causes the resulting model to generalise better and is much quicker to train.
This model relies each tree being less correlated with one another being more important than the accuracy of each tree.
If the model isn’t performing well then more uncorrelated trees can be added.
With the Extremely randomised trees model, it is important the whole dataset isn’t being used or a high percentage of the rows and features in each tree as this won’t generalise well.
Choosing using a Random forest or extra random forest is just a hyper parameter, which tree structure to use.
Grid search — Finding the best hyper parameters
A grid search can be used to find the best hyper parameters. All of the possible hyper parameters are passed in with all the possible values for those hyper parameters. The grid search runs through to determine which is best.
Random Forests are important as a first technique to apply and can often end up being the only technique you need to apply.
Most linear models have a lot of statistical assumptions where a lot has to be correct before the model starts to work at all. This isn’t the case with Random Forests.
Random Forests work on most sets of data, most of the time with most sets of hyper parameters.
With the PUBG tabular data competition on Kaggle Random Forests reached with 1% error of my best deep learning attempts, with considerably less feature engineering and training time.
I want to say a big thank you to Jeremy Howard and the fast.ai courses as these have helped teach me and give me a better understanding of these techniques.