Extrapolation is tough for trees (tree-based learners), combining learners of different type makes it less tough
Prepared by: Tom Hengl (OpenGeoHub)
Some popular tree-based Machine Learning (ML) algorithms such as Random Forest (RF) and/or Gradient Boosting have been criticized about over-fitting effects and prediction / extrapolation in feature space that can lead to serious blunders and artifacts. Extrapolation seems to be especially cumbersome for regression problems, and many at the order of magnitude less complex models seem to outperform RF when it comes to extrapolation. Serious artifacts due to extrapolation and over-fitting decreases confidence in RF, especially if the prediction intervals are also unrealistic. Here we demonstrate that an Ensemble approach, that combines both diverse learners (complex and simple, tree-based and linear and polynomial models) and robust cross-validation, can be used to find a working compromise between goodness-of-fit and structure in data. Furthermore, in the Ensemble ML framework, prediction errors are model-free and hence less prone to limitations of an individual learner. Prediction uncertainty estimated using Ensemble ML is more realistic, especially in the extrapolation space. The disadvantage of the ensemble approach is the computational load and higher complexity, including higher effort to understand models, more parameters, more results, more steps needed to sort, fine-tune and interpret results.
Random forest in nutshell
Random Forest (RF) is one of the most used Machine Learning algorithms in 2020. It can be used for both classification, regression and survival problems, and many implementations exist from which ranger (R) and scikit-learn (python) are also suitable for large datasets i.e. can be easily parallelized.
There are many excellent introductory books and manuals explaining RF. The main advantages of ML techniques such as Random Forest are:
- It has shown in dozens of case studies and kaggle competitions to outperform linear and non-linear statistical models,
- It can be universally applied to classification, regression and survival problems,
- It is highly suitable for complex multidimensional multivariate data with complex structures,
- It can be fine-tuned which often leads to increases in accuracy,
To learn more about Random Forest we advise following some illustrated R tutorials or reading Biau and Scornet (2016). To learn about how RF can be used to generate spatial and spatiotemporal predictions in R refer to Hengl et al, (2019).
Extrapolation and over-fitting problems of Random Forest
Much literature also discusses disadvantages of RF. Four common disadvantages of RF identified are:
- Depending on data and assumptions about data, it can over-fit values without an analyst noticing it.
- It predicts well only within the feature space with enough training data. Extrapolation i.e. prediction outside the training space can lead to poor performance.
- It can be computationally expensive, computational load increasing exponentially with the number of covariates.
- It requires quality training data and is sensitive to blunders and typos in the data.
In this article we address only the first two limitations of RF: the extrapolation and over-fitting problems.
Probably the best way to illustrate over-fitting and extrapolation problems of RF is to use synthetic data and then test sensitivity of RF on different levels of (pure) noise and different structures (distributions) in data. This type of analysis is very common in statistics and is also referred to as “sensitivity analysis”. One known example of over-fitting and extrapolation problems of RF is the example called “Extrapolation is tough for trees!” posted a few years ago by Peter Ellis, further discussed also in this Twitter feed by Dylan Beaudette.
In this experiment RF is used to fit a model in data that basically has simple structure and for a statistician would be a no-brainer to choose distribution, model and estimate prediction intervals. The code to generate synthetic data set is:
We can also fit RF model to this data and then plot the differences between the linear model and RF. This produces the following plot:
The prediction intervals, i.e. prediction upper and lower bounds shown above were derived using predictions +/- 1 standard prediction error. For Random Forest, prediction intervals can be derived using for example the forestError package (Lu & Hardin, 2021), which reports standard prediction error at each new prediction location, upper and lower bounds and even estimates potential bias.
Because this is a synthetic data we know that the fitted line only fits noise and we know that in the extrapolation space the predictions do not seem credible. Note, however, from the example above, extrapolation would not maybe be so much of a problem if the prediction intervals from the forestError package expressed more realistically that the predictions deviate from the “linear structure” in the data. Assuming that, after the prediction, one would eventually collect ground-truth data for the RF model above, these would probably show that the prediction error / prediction intervals are completely off. Most traditional statisticians would consider these too-narrow and over-optimistic and the fitted line over-fit, and hence any further down the pipeline over-optimistic prediction uncertainty can result in decision makers being over-confident, leading to wrong decisions, and consequently making users losing any confidence in RF.
Meyer & Pebesma (2020) discuss in detail what are the consequences of ignoring prediction in extrapolation space: typically the accuracy assessment derived using cross-validation within the feature space covered with training data, will not apply to extrapolation space. Practical solution to the problem? You can try limiting predictions ONLY to feature space with enough training data so called “Area-of-Applicability” (watch the lecture by Hanna Meyer). This might be somewhat disappointing because we would then not be able to produce complete consistent predictions for the whole area of interest.
Regression using diverse Ensemble ML algorithms
In the previous example we have shown how RF has problems with modeling relationships where there is visible structure e.g. linear regression. It both tends to somewhat over-fit as it curves around the points and hence basically fits noise, and default settings in the forestError package do not help detect a more realistic estimate of extrapolation uncertainty. RF is of course not to be blamed for this (it is a known characteristic of RF that it predict reliably only within the feature space determined by the training data), but the data scientists / statisticians that maybe ignore some properties of data and connected assumptions.
How to help ML get a more accurate estimate of the regression and how to prevent such problems? One solution is to consider that any dataset could in essence be a combination of non-linear and linear structure, and that for parts of the data different types of learning algorithms could perform with higher or lower success. This is the so-called Ensemble Machine Learning approach (Zhang & Ma, 2012).
For the synthetic data above for example, we can opt for fitting, instead of using only RF, a combination of 5 learners with very different properties:
- regr.glm: GLM i.e. linear model,
- regr.cvglmnet: GLM with Lasso or Elasticnet Regularization (Cross Validated Lambda),
- regr.gamboost: Gradient Boosting with Smooth Components,
- regr.ranger: Random forest,
- regr.ksvm: Support Vector Machines,
By default we recommend using the so-called stacking approach to ensemble ML in which case an additional (meta-)learner is used to learn from base learners. The state-of-the-art super learner approach to ensemble learning is described in Polley & van der Laan, (2010).
The resulting model and predictions (the meta-learner used is the simple linear regression) show the that the model is similar to the simple linear model: it explains 87% of variation i.e. it is as good as the linear model.
Next we need to estimate the prediction errors. This is often not trivial since for Ensemble ML we can not use some parametric technique but we need to use some bootstrapping techniques i.e. repeated re-fitting of models and then determining errors at new locations. Here we apply a simple computationally efficient procedure that consists of three steps:
- Determine the global prediction error / MSE using repeated Cross-validation.
- Determine variance of the multiple learners at new prediction locations.
- Adjust the variance of multiple learners at new prediction locations using a mass-preserving correction factor.
The correction factor is usually above 1, which means that most learners are over-fitting the data and the actual prediction errors are somewhat higher. You should always inspect the correction factor because if it is much higher than 1 then the over-fitting is significant and you might want to check and adjust the Cross-Validation settings.
The final plot using the synthetic data, this time fitted using Ensemble ML, is shown below:
Note that from the plot above, prediction error intervals in the extrapolation space can be quite wide, and this reflects much better what we would expect. As a rule of thumb, once the prediction errors exceed global variance in training data, the predictions become so uncertain that we would not recommend any decision in the extrapolation space (hence predictions could also be masked out or not used for decision making).
In summary: it appears that combining linear and non-linear tree-based models both helps decrease over-fitting and produce realistic predictions of uncertainty / prediction intervals. The Ensemble ML framework correctly identifies linear models as being more important than random forest or similar. The uncertainty assessment based on using Cross-Validation and/or bootstraping helps produce model-free estimate of the prediction errors.
Applying Ensemble ML to a real-life datasets
In the next example we look at the performance of Ensemble ML by using a real-life data set e.g. the meuse data set from the sp package. This data set has been intensively used to demonstrate various geostatistical and ML spatial prediction techniques (see e.g. Bivand et al, 2013).
We again fit the Ensemble ML by using the four learners from above aiming at explaining spatial distribution of Zinc concentration in the soil as function of distance from the river. We can fit the Ensemble ML model using the landmap package in a single line:
SL.library = c("regr.ranger", "regr.glm", "regr.gamboost", "regr.ksvm")
eml.m <- landmap::train.spLearner(meuse["zinc"], covariates=meuse.grid[,c("dist")], spc = FALSE, SL.library = SL.library, oblique.coords = FALSE)
# Min 1Q Median 3Q Max
# -632.90 -101.75 -38.21 64.08 898.94
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 11.291829 43.162115 0.262 0.794
# regr.ranger -0.284509 0.171958 -1.655 0.100
# regr.glm -0.006867 0.143319 -0.048 0.962
# regr.gamboost 1.255944 0.306575 4.097 6.83e-05 ***
# regr.ksvm 0.012866 0.334695 0.038 0.969
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# Residual standard error: 236.3 on 150 degrees of freedom
# Multiple R-squared: 0.5964, Adjusted R-squared: 0.5857
# F-statistic: 55.42 on 4 and 150 DF, p-value: < 2.2e-16
This time we use a so called spatial Cross-Validation i.e. CV that prevents spatial autocorrelation in values affecting estimation of model parameters (Lovelace et al, 2019). For this we determine the block size by fitting a variogram of the target variable using the geoR package (Brown, 2015), then the block size is set to correspond to the range of spatial dependence. The points that fall in the same spatial block, are either used for training or validation but never for both.
The results of fitting Ensemble ML show that now regr.gamboost is the best performing learner, while RF and other are marginally useful. The model explains 59% of variance in the data and which is much better than if we would use only simple linear regression (41%). The prediction intervals derived using the same procedure explained above shows:
Again, the extrapolation space receives the highest prediction uncertainty. Note that Ensemble ML again has no knowledge of the target variable (in this case: negative values of zinc are not possible) so the estimation could be further adjusted by using log-transformation or similar, or simply by replacing negative predictions by 0 values.
So in summary we hope we have demonstrated that even though extrapolation is tough for trees (tree-based models), by combining diverse learners and running Cross-Validation with refitting, we can reduce over-fitting effects and produce more realistic estimates of prediction errors including for the extrapolation space. The disadvantage of the ensemble approach, however, is the computational load, including higher effort to understand models. There are also much more parameters, more results, more steps needed to sort, fine-tune and interpret results. To learn more about the Ensemble ML, please also refer to this tutorial and the landmap package in R. The code used to generate examples in this post is available here.
Word of caution: even though the landmap package appears to be extremely easy-to-use as predictions can be produced with 2 lines of code, it can still lead to unrealistic estimates and serious bias, so doing a lot of model diagnostics and post-modeling interpretation is highly recommended even if all steps are fully automated. Also, the package is not suitable yet for large datasets. We recommend reading, for example, the “Interpretable Machine Learning” book by Christoph Molnar to learn more about how to do post-modeling diagnostics and understanding why models predict the way they predict.
Important note: the examples in this article are based on using the mlr package. This package is unfortunately discontinued and users are advised to migrate to the mlr3 package. If you manage to fit the same type of models listed here using mlr3, please share and we will update the code.
- Biau G, Scornet E. (2016). A random forest guided tour. TEST 25(2):197–227, https://doi.org/10.1007%2Fs11749-016-0481-7
- Bischl, B., Lang, M., Kotthoff, L., Schiffner, J., Richter, J., Studerus, E., … Jones, Z. M. (2016). mlr: Machine Learning in R. The Journal of Machine Learning Research, 17(1), 5938–5942.
- Bivand, R. S., Pebesma, E. J., & Gómez-Rubio, V. (2013). Applied spatial data analysis with R (Vol. 747248717). New York: Springer. https://asdar-book.org/
- Brown, P. E. (2015). Model-based geostatistics the easy way. Journal of Statistical Software, 63(12), 1–24. https://www.jstatsoft.org/article/view/v063i12
- Hengl, T., Nussbaum, M., Wright, M. N., Heuvelink, G. B., & Gräler, B. (2018). Random forest as a generic framework for predictive modeling of spatial and spatio-temporal variables. PeerJ, 6, e5518.
- Lovelace, R., Nowosad, J., & Muenchow, J. (2019). Geocomputation with R. CRC Press.
- Lu, B., & Hardin, J. (2021). A unified framework for random forest prediction error estimation. Journal of Machine Learning Research, 22(8), 1–41.
- Meyer, H., & Pebesma, E. (2020). Predicting into unknown space? Estimating the area of applicability of spatial prediction models. Methods in Ecology and Evolution. https://doi.org/10.1111/2041-210X.13650
- Polley, E. C., & van der Laan, M. J. (2010). Super learner in prediction. U.C. Berkeley Division of Biostatistics. Retrieved from https://biostats.bepress.com/ucbbiostat/paper266
- Seni, G., & Elder, J. F. (2010). Ensemble methods in data mining: Improving accuracy through combining predictions. Morgan & Claypool Publishers. https://doi.org/10.2200/S00240ED1V01Y200912DMK002
- Wright, M. N., & Ziegler, A. (2017). ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software, 77(i01). https://www.jstatsoft.org/article/view/v077i01
- Zhang, C., & Ma, Y. (2012). Ensemble machine learning: Methods and applications. Springer New York.