Problems with Random forests
Even though random forests are very robust they have some problems. This post covers some of the problems and it will also include a few tips to solve these problems up to some extent.
Before going into the details if you are interested in understanding the random forest model and how you can interpret its results, check out these two posts, 1 and 2. Random forests decision surface consists of lines parallel to the axis. Because of this, it needs to make a lot of splits to even model a simple straight line. But this can be easily solved by just rotating the axis. The degree of rotation may not always be clear, but a simple trial and error should work in most of the cases. After rotating the decision surface will become parallel to the axis which makes it easier for random forests to model. This is illustrated in the figure below.
In the above case even if we don’t do any rotation manually the model will still be able to fit, though it may use more splits. However, the main problem of Random forest is its inability to extrapolate. That means it cannot estimate properly if the features range is different from what it is trained on. To understand this problem we can create a simple dataset which has a linear relationship. This dataset is created using the code below.
Here we assume that there is only one feature x which is created by choosing 100 evenly spaced values between 0 and 1. As we wanted to have a linear relationship, the target variable y is created by created by just adding some noise to x, in this case somewhere around -0.1 to 0.1. The following picture is a scatter plot of x and y.
Now we split the data into two parts train and validation, for training first 80 instances will be used and the last 20 will be used for validation (Standard way of splitting for time series problem). Random forest is used to model this relationship and as expected works really well on training data as it’s a simpler relationship. The scatter plot given below illustrates this.
But it performs horribly on validation data, it assigns the same value to all the instances in the validation set. It shouldn’t come as a surprise if you understand the working of random forests. It merely returns the average of nearby observations where nearby observation, in this case, are the ones which fall in the same final leaf node of each tree.
As the model is trained on data which is in different range it cannot find anything in that range. As there is nothing to average it ends up giving the value which is assigned to the maximum value in the training data ( X) to all validation data instances. That’s why we see a horizontal line when we plot a graph between x_val and the model predictions for them.
Ways to solve the extrapolation problem
- One way to avoid such situation is to remove all unnecessary time-dependent variables and train the model on other features.
- But again when there is really a time series like in the example above, there is nothing much we can do. We can try to use some time series techniques to detrend the data but it still won’t make the model perform very well.
Though random forests are very robust and can work on most of the datasets like any other model it has some problems and won’t work well on datasets which have time series trends (Extrapolation). It’s important to identify this quickly and in such cases use other models like Neural Nets which can capture such trends easily.
Thanks for reading !! All the code used for this post is available here.