Crossing the Ts in our Data Set:

How Cross-Validation allowed us to measure the accuracy of our forecasting model

Published in

tymeshift

10 min readNov 26, 2021

Imagine a tool that can predict the future, act upon those predictions, and then adjust itself to the unexpected. Seems, like a unicorn, to be something impossible — too good to be true.

We have been building a unicorn. We have a tool that is getting closer to that unicorn every day. It is something that we have been working on for more than a year and it is one of the most interesting and important features in our suite. We are constantly working on improvements by listening to our customers.

We call it Forecast.

It’s on the right track. Let me tell you how we know that.

Idea

I joined tymeshift as a Data Scientist recently (in September this year) and from the start, I knew that one of my main tasks would be to improve Forecast. Although I have experience with forecasting models, I needed time to understand how this current model works, but even before that — to deeply understand the WFM domain. This is something crucial in the process of upgrading the existing model. In other words, you should understand WHAT data is used for creating the model and HOW the end-user uses the model’s results.

In this case, input data is time series — number of tickets coming every 15 minutes. And basically, the output is the prediction of incoming number of tickets every 15 minutes for some period in the future, for example, a month. The end-user could use this information to manually create schedules for agents or to run automated schedules.

An important thing to note here is that we should expect a lot of zeros (both for input and predictions) given that there could be periods without any tickets for various reasons: non-working time, holidays, failures, lunch time, etc.. Also, we anticipate that numbers of incoming tickets won’t be large — we are talking about 15 minute periods, after all. That means our forecast should be very sensitive: even the smallest difference between original and predicted value could be a big problem in percentage terms.

Keep this in mind, we’ll see later why this is so important.

Considering all of this, I asked myself: okay, I understand how this current forecast model works, but how good is it? If I change something, how would I know if the results are better (or worse)?

And that was the moment when the journey to find the most appropriate way of measuring forecast accuracy started!

Choosing the “right” metric

In pursuit of the most appropriate metric for measuring forecast accuracy, I came across many scientific papers and, in the end, came to a conclusion that there are several commonly used metrics, but also that there are many variations of those main metrics. Knowing that there’s no such thing as “perfect metric”, I tried to find a couple of metrics most suitable for this particular case, based on a few criteria:

Insensitivity to zeros
One criterion is to have a metric that is not sensitive to zero values. If you remember, we mentioned that the data we used for forecasting could have lots of zeros. In that case, it might happen that some metrics could not be calculated or could give us infinite (very large) or undefined values. This limits us to using more robust metrics.
Scale-free
What is also important is to have a scale-independent metric. Since we want to improve our forecasting, we should have a metric that allows us to compare different hyperparameters of the model or even different models. That way, we can be sure that the results of our experiments will be reliable and will help us to find the best parameters/models.
Interpretability
Finally, we need a metric that can be easily interpreted and understood by our customers. At first, we thought it’s okay to have something like “internal metrics”, great at the first two criterias, but not so easy to interpret. These would be used just for comparing different experiments. Later, for customers, we would have other, more explainable metrics.

But that obviously leads us to a mistake — we would optimize our models by one group of metrics and present results with completely different metrics, which may not be so representative. So, we decided to find a compromise between all of the three criterias and use a group of metrics which will satisfy all of them at some point.

Now, let’s see what we have chosen!

Standard Deviation

Standard deviation is a metric that shows how dispersed the data (in our case — errors between original and predicted values) is in relation to the mean.

The lower the value of standard deviation, the closer is the fit of the forecasted values to the original ones. The smallest possible deviation is 0 and that happens when every value in the data is the same. Another usage of this metric is to apply it to original values, to see what is the volatility of the original data. If that number is high, it can be expected to have large errors when forecasting.

Additionally, you can compare these two standard deviations. The smaller the standard deviation of errors is compared to the sample standard deviation, the more predictive, or useful, the model is.

It can be easily calculated in Python using numpy library:

To show this in an example:

Since standard deviation is in the same unit as original data, we can say that this result means that errors of forecasting are on average 3.56 tickets away from the mean value of all errors. Whether that’s good enough depends on the objectives that you want to achieve. Compared to the sample standard deviation (2.94), we can say that our model could be better because standard deviation of errors should be lower than sample standard deviation.

The main disadvantage of this metric is that it is scale-dependent, we can’t use it to compare results with some other data.

Mean Absolute Error (MAE)

Mean Absolute Error, as the name suggests, shows us how big an error we can expect from the forecast on average. It is one of the most commonly used forecast metrics, mostly because it’s very easy to interpret and it can be easily implemented in Python:

Of course, the lower the value of MAE, the better, because that means we are closer to the original values with our predictions.

Using the same data as for standard deviation, we can see the result for MAE:

This simply means that the absolute error between original and forecasted values is 3 tickets on average.

As far as shortcomings are concerned, again, this metric is scale-dependent. So if the result of MAE for another data is 2 tickets, for example, we cannot say that model is better. Also, we don’t know the direction of errors, we don’t have the information if we are 3 tickets under- or over-forecasting.

And finally, we should be aware that MAE may understate big, but infrequent errors, so it may happen that we have at some point a really big difference between original and predicted values, but on average we will not see it.

Mean Absolute Percentage Error Stable (MAPE Stable)

Simple MAPE is another example of well-known metrics for forecasting. Compared to previous metrics, MAPE is not scale-dependent. It’s very good for comparing the results of different models/data. Since it’s one of the percentage metrics, it’s very easy to interpret (best value is 0, worst value is 1).

Sounds perfect right? Well almost.

If you have data with lots of zeros, this metric is practically useless. And we mentioned that we have just that case. We didn’t want to reject MAPE completely, but we thought about adjusting it. That’s how MAPE Stable was created.

MAPE Stable implementation is the same as for MAPE, but we’re only taking into account non-zero values:

Applied to our example:

This tells us that the average difference between the forecasted values and the actual values is 7.3%, nice and simple.

One disadvantage of MAPE Stable is that we still don’t know the direction of the errors.

Forecast Bias Percentage

We decided to try this metric also because none of the previous ones has information about error directions. Forecast Bias Percentage shows us just that — results of more than 100% mean over-forecasting and results below 100% under-forecasting.

In our case, that would be:

Obviously, since the result is more than 1 (or 100%), we are forecasting more than the original data.

What forecast bias percentage does not show is how much we are over- or under-forecasting, but that is something we can figure out with the rest of the chosen metrics.

NOTE: Be sure to remove outliers from the data before training the model! They can negatively affect the model quality, but also the evaluation metrics and give you irrelevant results. There are many techniques for doing that, which can be the material for another blog post, but for starters, use the simplest one — Z-score.

Cross-validation

After choosing a group of metrics that will give us a complete picture of how good our forecast is, we need to make this evaluation somehow consistent and independent of testing data.

What does that mean?

Let’s say we have a year of data for training, from 1st January to 31st December. Since we want to evaluate our model, we need to have a smaller set of data for testing, for example a month of data.

Working with time series, the important thing is the order of the data. That means our data should be sorted ascending and the last month of data should be used for testing. We can’t use data from the future to predict the past, right?

You may have noticed that, in our case, the last month of data is December. But, as you know, December is very specific in many industries because of holidays. We can’t rely on results based on December only. And if we try to optimize our model always using the same training data, we could end up with pre-trained data (overfitting). That’s why we should use cross-validation.

“Cross-validation is a resampling method that uses different portions of the data to test and train a model on different iterations.” (source:https://en.wikipedia.org/wiki/Cross-validation_(statistics))

There are different ways to perform cross-validation: Leave One Out (LOO), Leave P Out (LPO), ShuffleSplit, KFold, StratifiedKFold, Stratified Shuffle Split, GroupKFold, etc. But, since we have a time series, we should use something more tailored for time series.

Time Series cross-validation

In the case of time series, test set should follow the training set because of the ordering of observations. That’s why we used the TimeSeriesSplittechnique (from the scikit-learn Python library).

As you can see from the picture, the idea of TimeSeriesSplit is to always split the training data into two parts: one for training and other for testing. One condition that must be satisfied is that testing data should always follow the training data.

We can add one additional useful condition — the size of the testing data. In our case, this will be a month of data. In this technique, we take the first split, train on data from 1st January to 31st January, and test on data from 1st February to 28th February. In split 2, training data will be from 1st January to 28th February, and testing data from 1st March to 31st March, and so on, until we cover the whole dataset that we have.

This approach has some flaws: the model will observe future patterns to forecast and try to memorize them. Still, it is way better than not having cross-validation and the model will be less prone to overfitting.

Using TimeSeriesSplit in Python will look something like this:

If we use all metrics that we’ve mentioned in this article for evaluation method, as a result we’ll get something like this:

So, these are the average evaluation results for all of the cross-validation iterations. This tells us what we can expect from our model.

You should be able to interpret these results by now, too. 😊

The First value is standard deviation or volatility of original values. It’s not a very high number, telling us it is possible to have a reliable forecast. Compared to the standard deviation of errors, we can see that it’s higher, which is a good sign. If you remember, the smaller the standard deviation of errors is compared to sample standard deviation, the more useful the model is.

MAE of 4.19 means that, on average, absolute error between original and predicted values is 4.19 tickets. How good that is depends on industry standards and on the data, but if we know that the mean value of original data is 13 tickets, this is a pretty good result.

MAPE Stable is 25%, which tells us that the average difference between original and forecasted values is 25%. Hmm, this could be better, but still, we are not so far from the best possible MAPE Stable value, which is 0.

And finally, since the forecast bias percentage is 81%, which is below 100%, we know that we have a situation of under-forecasting.

Wrap up

The accuracy of the forecasting model is really something that depends on multiple factors and it’s not so straightforward. You can choose some metrics, but if you can’t interpret them easily and customers don’t understand them, you have nothing. Because of that, the process of choosing metrics and evaluation of the model should be very carefully planned and implemented.

We need some time for that. But for now we have a group of metrics that can, while using cross-validation, give us reliable information about how good our forecasting model is.

Nevertheless, we can’t emphasize enough that perfect metrics don’t exist and it doesn’t mean that these metrics are going to be the cat’s whiskers for everyone.

But at least, it will give you a good starting point! 😊

About us:

We’re tymeshift, an effortless WFM solution made exclusively for Zendesk. We make managers’ lives easier with scheduling and forecasting tools, and agents’ lives easier with a perfectly intuitive Zendesk integration. Learn all there is about us and our product on tymeshift.com

🚨 By the way, we’re hiring! 🚨 So if you loved the article and would dig the chance of working alongside Marija and the team developing or putting to the test groundbreaking products like our Forecast Model: check out our openings and apply here.