Regression prediction intervals with XGBOOST
(Update 2019–04–12: I cannot believe it has been 2 years already. As I have been receiving various requests for updating the code, I took some time to refactor , update the gists and even create a colaboratory notebook. I apologize for the people who had difficulties running the past version. If you have any questions, please ask, I gladly make some time to answer!)
(Update 2020–08–19: Thank you for helping me refine the split-gain argument)
Knowledge of the uncertainty in predictions of algorithms is paramount for anyone who wishes to make serious predictive analytics for his business. Predictions are never absolute, and it is imperative to know the potential variations. If one wishes to know the passengers volume for each flight, he also needs to know by how many passengers the prediction may differ. Another could decide to predict disembarking times. There is of course a difference between a prediction on a scale of a few hours with a 95% chance of correctness up to half an hour, and a potential error of 10 hours!
Here, I present a customized cost-function for applying the well-known xgboost regressor to quantile regression. Xgboost or Extreme Gradient Boosting is a very succesful and powerful tree-based algorithm. Because of the nature of the Gradient and Hessian of the quantile regression cost-function, xgboost is known to heavily underperform. I show that by adding a randomized component to a smoothed Gradient, quantile regression can be applied succesfully. I show that this method can outperform the GradientBoostingRegressor algorithm from the popular scikit-learn package.
Why prediction intervals?
While models output, hopefully accurate, predictions, these are themselves random variables, i.e. they have a distribution. Prediction intervals are necessary to get an idea about the likeliness of the correctness of our results. This likeliness determines an interval of possible values.
Let us say, we wish to know the range of possible predictions with a probability of certainty of 90%, then we need to provided two values Q_alpha, Q_{1-alpha}, such that, the probability of the true value being within the interval
is 90%.
The basic statement of linear regression is,
with the assumption that the observations x_i and ε_i are independent and normally distributed.
Under the assumption that the error-terms are identically normally distributed, the maximum likelihood estimate of the coefficents β_j minimizes the variance of these error terms.
This variance is, of course, the maximum likelihood (ML) estimate of the variance of the prediction. Under our assumption of normality of both the features and the error, the prediction is also normality distributed.
The prediction intervals for normal distributions are easily calculated from the ML-estimates of the expectation and the variance:
The 68%-prediction interval is between
, the 95%-prediction interval is between
and the 99.7%-prediction interval is between
This method has a number a limitations. First of all, we are heavily dependent on the basic assumption of normality, which is of course not always true. Finally, the simplicity of linear regression limits us to a very broad prediction interval. Ideally, we wish our prediction boundaries to also depend on more features. In the next session, we talk about such implementation using tree models.
Tree models
Regression trees are a very powerful prediction tool. The general idea of trees is to partition the features and assign a value to each of these partitions. This value is assigned in such a way that a cost-function is minimized within the partition.
Quantiles can be found through the following optimization problem,
Quantile regression for xgboost
I have not explained yet how values are assigned to each partition.
Xgboost is powerful algorithm which does this for us. For further reference, the reader should check the original paper,
Chen, Tianqi, and Carlos Guestrin. “Xgboost: A scalable tree boosting system.” Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2016.
The main ingredient to further partition a volume I into two smaller subvolume I_L, I_R is the score function,
If a split occurs, then a new value for the quantile q_new within each subvolume I_α is updated by adding the following value to the old value of the quantile of the larger volume q_old.
The parameters λ and η are respectively the regularization and shrinkage parameters, which I do not further discuss here.
What matters here are the values of the gradient and the Hessian respectively:
In the figure below, we see plots of the gradient and the Hessian.
As we can see the gradient is a step function while the Hessian is zero everywhere and infinite at the origin.
Looking back at the score function L_split for splitting, we see a few problems.
Let us look at the case when the quantile value q_old is relatively far apart from the observed values within the partition. Unless, we are lucky, this is the case at the start of boosting. The gradient and Hessian are then both constant for a large difference x_i — q . This implies that the split gain is independent of any feature. Even worse, the optimal split gain tells us that any split is sufficient as long as we split the data volume in half! When looking for a split, xgboost will attempt at finding the feature with optimal split. As all features yield the same split gain, i.e. half of the data in the node goes left and other right, no split will occur.
If we use the gradient and Hessian shown above, or even use smoothed variants, very few splits will occur and the model will underperform.
An interesting solution is to force a split by adding randomization to the gradient. When the differences between the observations x_i and the old quantile estimates q_old within partition are large, this randomization will force a random split of this volume.
This being said, I propose the following smoothed variants of the gradient and Hessian,
with random variable ρ. For this experiment, I chose ρ simply to take either one of the two values s or -s with equal probability, for some parameter s.
The Hessian:
In code:
We apply this code to the data generated with the code below, also used in the sklearn example: http://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_quantile.html
A comparative result for the 90%-prediction interval, calculated from the 95%- and 5%- quantiles, between sklearn’s GradientBoostingRegressor and our customized XGBRegressor is shown in the figure below.
For the 95%-quantile I used the parameter values
and for the 5%-quantile, I used
which were found by grid search. When scoring on the test-set using the cost-function, it was found that both 95%- and 5%-quantile of the customized xgboost were superior.
checkout the colaboratory notebook HERE!!!
