Evaluating the Error of Regression Models: Some Real-Life Challenges and Practical Tips

Shahar Cohen
YellowBlog
Published in
5 min readOct 2, 2017

In regression, we predict (or estimate) the numerical value of an unknown quantity, denoted by y, as a function of some explaining features. The difference between the prediction and the actual value is the error. It is a numeric random variable, that might depend on the explaining features.

There are few common ways for estimating the performance and comparing several competing regression modes, which differ in their ways to measure error. Probably the most common measures are: Root Mean Square Error (RMSE), Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE). All these 3 measures are straight forward to compute and understand, however applying the right measure in real-life use cases is far from being trivial, and in most times, none of these measures cannot tell the entire story. In this post I describe some of the difficulties of evaluating the error of regression models, and suggest some practical tips on how to tackle these difficulties.

The main limitations of RMSE, MAE and MAPE

Here are two important limitations of RMSE, MAE and MAPE:

  1. Each of these measures is merely an average (or the square root of that average in the case of RMSE) of realizations of the test errors. The error is a numeric random variable and you cannot grasp the entire behavior of a random variable with a single aggregation of observations. The error is just a random variable, it is often a highly skewed random variable. When we predict skewed outcomes (like prices, incomes, user engagement, item sales and many many more) most probably that the error will also be skewed, meaning that in the majority of the cases the error is very small, but there are relatively rare examples that might have extremely large errors. When the error is highly skewed, the average often says nothing.
  2. It might be the other side of the same coin, but it provides a different perspective. In average, all the observations equally contribute to the total error, regardless of the actual value of y. But error means something different, when associated to different values of y. For small values of y (e.g., near zero), we can accept relatively high percentage error, as long as the absolute error is small. For large values of y we may accept larger absolute errors, but not high percentage errors.

An Illustrative Example

The following plot describes the predictions of a simple regression tree, on a common regression task (in which the objective is to predict sales prices).

Each sale in the plot is a single point, where the X-axis describe the actual price, and the y axis describes the prediction by the regression tree. The tree has 6 leaves that separates sales into price ranges almost perfectly. Since the major part of the sales is associated to very low prices, and due to the scale of the plot, the first 3 ranges cannot be recognized. Such sales ranges may be found, for example, in online advertisement applications.

Since the model is a regression tree, a single value (leaf average) is associated with each price level, but the range of the actual prices in each level is different. Where in the low-price sales the range is limited, the highest range describe sale with actual price range between ~1000 and 4000. Clearly this mean that most of the errors of the regression tree are expected to be quite small, with a few extreme outliers.

The following plot describes the distribution of the error.

Most of the errors have relatively small absolute value (actually near zero in most of the cases) but there are few extreme errors. The MAE is 3.633, where the median absolute error is only 0.83. The really few large absolute errors significantly affect the overall performance. The RMSE in this case is 32.7(!!!), since it gives quadratic strength to the few extremely big errors. Clearly this number say nothing on the typical expected error.

Although there are some big errors, the model clearly differentiates well between low prices and high prices. The large errors are associated with large actual values of y. Predicting a price of 2,300 instead of actual of 1,800 may be acceptable, where predicting a price of 501, instead of an actual of 1, is unacceptable, yet both cases contribute an absolute error of 500 to the aggregation measure. This implies that measuring the percentage error might be a good idea.

The MAPE here, however, is 1,129%. Seems to be very bad. The maximal percentage error is over 3,000,000% and it is obtained when y = 3.106549e-05 and the prediction is 1.003 (an absolute error of ~1). The median percentage error is 51%, and the average percentage error when y > 1 is only 32%.

No single number can tell the entire story

Communicating performance measures to the business peers

Where all the above peculiarity might be perfectly clear to data scientists (although we have met so many data scientists failing with these pitfalls), it becomes extremely hard to communicate the complex reality to business peers. No one will accept an average error of 1,000%. No one will accept mean absolute error of 3.5, knowing that most of the sales are closed with prices of less than 1. Failing to communicate the real meaning of a model performance to the business peers is the reason for a no-go decision to POCs that could have been a huge success.

Practical tips on measuring the performance of regression models

Here are few practical tips on measuring the performance of regression models.

  1. Really early in the process (before modeling the data, and even before preparing it for modeling) start a discussion with the business peers on what would make the project a success, business wise. Then try to translate this important business understanding into performance KPIs, and approve these KPIs with the customer. In many cases we find ourselves explaining to the customer the limitations in some suggested KPIs. This way, you don’t have surprises in the validation step, and you can use the right KPIs in the model training step.
  2. Never use a single static aggregation as the sole performance measure.
  3. Never use KPIs as black boxes. Remember that error is a numeric random variable, and strive to learn as much as possible on its distribution (is it symmetric? is it correlated with y? does it have extreme values? Is it correlated with any of the explaining features? Etc.). We often use the findings in the analysis of the error that allows us to improve the model (move from linear to non-linear, building an ensemble or a composition of models, and so on).
  4. Always communicate the meaning of the performance measures to the business users, and help them use this information to tailor a better operational process (e.g., use the model only on specific pre-defined cases).

--

--

Shahar Cohen
YellowBlog

Data science researcher and entrepreneur, helping companies to start up with AI.