Building a time series forecasting model to predict future annual leave

Published in

Sage Ai

7 min readJan 11, 2023

Sage offers an HR management and automation solution called SageHR for small business owners and small companies. On that platform, users can easily submit and approve time off requests, and HR managers can manage leave policies and PTO accruals without effort.

In our customer research, many business owners expressed difficulty in managing their daily business with employees taking unexpected time off. To help them better plan their day-to-day business operations, Sage AI developed a machine learning model that forecasts companies’ time off, and embedded it in the SageHR product to display the forecasts to the users. In this blog post, we will talk about our modeling approach and challenges.

Modeling Approach

SageHR has thousands of customers. Many of them are small companies with around 40 employees on average. Each has a different historical pattern in their time off (some examples are shown in the figure below).

As there are thousands of companies, developing one model per company was not a feasible option. Instead, we built a global model that can make a forecast for all of the companies at once. The model is trained on the entire dataset comprising all of these companies, but it is capable of learning individual time series patterns for each company. This global model also cross-learns any shared time off patterns across companies, which gives us a better predictive performance.

We defined our problem as predicting weekly time off per company over the next three months based on the business requirements. Forecasting multiple steps in a time series is an example of a multi-step forecasting problem. As regular machine learning models can only predict one step ahead, we needed a strategy to deal with the multi-step problem. As the Machine Learning Group at the University of the city of Brussels defined in this paper, there are a number of popular approaches, including:

Recursive approach

One-step predicted value is used for the model input feature in the next step prediction. This continues recursively until the model predicts the final step. In that recursive process, the same single model keeps being used to predict all of the forecasting periods. This helps to simplify the modeling process, but it has drawbacks in its prediction quality, as the prediction error in the early steps propagates over the later steps, especially when you have a longer forecasting horizon.

Direct approach

Multiple different models are developed to predict multiple forecasting periods. Every single model only takes care of a single prediction period. The same features may be shared across all of the models, but the model targets are different. Separate modeling for each prediction period resolves the drawbacks of the recursive approach. However, it increases the model management cost in production as you need to train and deploy a number of models equal to the length of your forecasting horizon.

Multiple output approach

A single model predicts all of the forecasting periods. This helps the model to capture any stochastic dependency across target values (i.e. correlation between the targets in different steps) that the direct approach ignores. Only limited types of models support multiple outputs, such as random forest or neural networks.

Different approaches for multi-step forecasting problem

We experimented with each of these approaches with different types of algorithms (such as ARIMA, Prophet, random forest, gradient boosting trees, neural nets, deep learning, etc.) and it turned out that the direct approach with a gradient boosting tree performs better than the others. While the direct approach has some redundancy in its implementation, it captures target patterns better than other approaches, as each model focuses only on one specific period in the forecasting horizon.

Challenges

One of the challenges of time series modeling is training a model without unintentionally using any future information. The predictive time off problem we tackled came with a unique challenge; the data has two different time axes: one is the date of time off, and the other is the date of time off being logged in the system. These two different time axes complicate the problem. While we are interested in predicting time off in the future, part of it is already logged in the system. So, what we want to predict is any future time off that is not yet logged in the system. Also, we cannot assume that all of the time off in the past is already logged at the moment of prediction, as some of them will be logged in the future. For example, if one employee suddenly got sick and took sick leave last week, he/she may not have logged it in the system yet.

Complications of time off availability and model target

Regular time series algorithms use past data to predict the future, but that is not applicable to our problem. What “past data” means to the model is not the time off taken in the past. Instead, it means the time off that was logged in the system in the past. Our analysis showed that 28% of time off is logged after the fact (see the plot below). It also showed that if we use all the time off taken before the prediction moment for the model training, the model’s predictive performance will be degraded by 50% in production.

Days in advance (time off start date - time off created date) distribution

Our strategy to tackle this problem is defining a two-dimensional time series and then designing our model target and features based on those dimensions. The picture below shows how we defined them. The left image defines the model target, and the right one defines the model feature. Both images have weeks of time off in the x-axis and weeks of time off logged in the y-axis.

Assuming today is the end of t₀ week, the red rectangle in the left image will be our model prediction target for the first week in the forecasting horizon. The gray rectangles are the training targets that are defined in the same way as the prediction target. The light-yellow area in the right image is all of the time off logged before today that is available for the model features. Our experiment showed that the most important time off for the model feature are the ones that are logged in the same week as the time off date (the deep yellow rectangles). This is because most time off is logged around the start date of time off (see the previous picture of the days in advance distribution.) This model target and feature design helped us to train a leak-free model.

Result

The model prediction error (mean absolute error) by company size is summarized in the table below.

For those who are not familiar with mean absolute error, it is the absolute difference between the predicted values y^ ᵢ and the actual value y ᵢ averaged over all of the companies n.

As you can see in the table, the target value gets bigger as company size increases, and the prediction error also increases since bigger companies have more variance in their weekly time off. So far, the prediction error is not insignificant compared to the model target values. This is because time off involves a considerable amount of uncertainty that makes it hard for the model to predict it accurately.

The below image shows how the time off forecast is displayed in the UI.

Wrapping up

In this blog post, we illustrated the process of how we developed the predictive time off model on SageHR. The project is still in its early stages, and we are currently collecting customer feedback on the product, which will potentially help us to update the model, and to provide better customer experiences. In Sage AI, we are building a broad spectrum of machine learning models to improve the customer experience on Sage software products, from outlier detection in general ledger journals to machine assistants in intelligent timesheets.