Improving Forecasting Confidence
How we productionised our Session Forecasting tool
Code exists everywhere in Skyscanner and not always in the most obvious of places. Take our Session Forecasting service, for example; its humble beginnings started as a small model project on a data scientist’s laptop. This is the story of how we turned that project into a robust and reliable production service.
What is Session Forecasting?
We forecast the number of user sessions on the website based on previous figures and rates of growth. Session forecasts are used by our financial teams to in order to help shape budgets for the following year. The forecasts include useful data such as region, channel, device, platform in which we can drill down and make more targeted analysis and predictions.
Backed by Skyscanner
The numbers these forecasts produce are directly linked to revenue and therefore heavily impact annual budgets throughout the entire business. We wanted the tool responsible for this to be protected with testing, measured with metrics to increase confidence and ran centrally for reliability. Notably these benefits would just be the start, as we set ourselves up for future automation and faster, more reliable development.
In order to get off the ground we needed the company to back us so that we had the time and funding to go and achieve our goals. This involved pitching to the board and highlighting the importance of this process. We understood that the new squad would have a finite lifetime (circa 6–12 months), so it was a good opportunity for those of us in the new squad to try something new.
(in ancient Rome) a religious official who interpreted omens by inspecting the entrails of sacrificial animals.
Thankfully, no animals are hurt in the generation of our forecasts. The project had been lovingly called “model” in its old form so we decided to rebrand it — Haruspex. We quickly went to work on breaking down Haruspex into two main areas of concern, data and model.
Our initial reaction was to dive straight in and begin refactoring the code. This involved limiting the number of data sources, consolidating queries to do more of the lifting (as a lot of this was done in Python) and refactoring so that it opened itself up to allow us to add tests.
Whilst we had begun to solve a number of the issues, it was clear that the priority lay in getting us to move faster. We therefore decided to target the following areas sequentially:
- Implement CI pipeline — we desperately needed to block merges with automated checks (even with the limited number of tests we had written).
- Achieve an acceptable test coverage — a handful of unit tests had been written as part of the initial cleanup, but not enough to protect us from breaking changes.
- Measuring model accuracy between changes — a manual process that was a bottleneck requiring a specialist in the team.
- Run forecasting in a live environment — kicking off from a local machine was chewing up resource time.
Continuous Integration Pipeline
Skyscanner’s default build tool is Drone. By adding a configuration file to our project we were able to specify the commands we wanted to run before accepting merges to our master branch. The set up was simple, however there was a clash with the R implementation of the forecasting model, which held up the completion of this as drone could not complete builds.
We decided to replace our R implementation with a third party tool from Facebook called Prophet, cutting out a lot of code in the project. Dramatically, our pre-implementation evaluations had shown that Prophet very closely forecasted actual data. This was a vast improvement on the old R (ARIMA) model, as shown below.
With these changes our forecast’s accuracy improved, our CI pipeline became live, and a chunk of tech debt was cleared.
When we inherited the project, there wasn’t a single unit test in place. Even though we created unit tests as we made code changes, we found we weren’t getting enough coverage. We had extremely low confidence in making changes to the project, with a large amount of manual checking taking place with the forecast numbers in what seemed a ‘hit and hope’ approach.
Since our unit test coverage was making slow progress, we decided to go for 100% coverage by writing component tests with the understanding that they would be faster to write than unit tests but more costly to run. The components covered were ones that involved getting and processing data from external data sources. The tests we wrote for them ensured that the data retrieved was in the shape we expected. This meant that if someone changed the structure or, to an extent, the content of the data we were consuming we would know right away if it would impact our forecasts. In the space of three weeks we had achieved this.
We needed the safety net of this coverage to give us the confidence to make changes. However, ideally, we would replace these component tests with unit tests, which is something we fed into smaller portions of work rather than trying to write in one go, as this would take much longer (not to mention the risk that would pose to our sanity!).
Model Evaluation Metrics
To bolster our confidence further in Haruspex, we added functions to compare forecast files from differing versions of the project. The idea here is to spot differing trends in the data, as realistically the numbers should be as close to equal as possible.
The procedure involves running a few commands to output Model Evaluation Metrics (MEM) files on both the current version and the last-known-good version of Haruspex. We then put the two MEM files from each version of the project into a MEM comparison function which returns a report on how different the models are and whether we have crossed an unacceptable threshold. For models that are in an unacceptable range, we know that something isn’t quite right with the new version and action can be taken.
With the origins of the project being an exploratory data science venture, forecasts were traditionally produced on someone’s laptop. There was no production environment in which to run the forecast job. Having no production environment has a number of drawbacks, not least that it ties a local machine up for an hour while a forecast runs. If we want to make comparisons, double that time. That’s only for one forecast as well — if we want to run historical ones, we multiply further.
We decided it was time for us to push forecast runs up to AWS to save resource and time. By integrating a new command into Haruspex, we can now push runs straight up to AWS without tying up our local machines. Additionally, when the job completes we receive a handy Slack message with the completed forecast bundle file which is pushed up to S3.
- For faster test coverage for a project with no tests, component testing with broader tests worked well for us. Despite the overhead, it enabled us to gain confidence quicker than the time it would have take to write refined unit tests.
- Measuring makes all the difference. You don’t realise how blind you are until you start seeing metrics, the ability to compare versions of Haruspex meant we would never go backwards in terms of accuracy.
- Experimentation can give jackpot results. As we learned when we experimented with Facebook Prophet, the pay-off was a much more accurate model.
Our next goal is to fully automate our forecasts in Production at a daily level. This would eliminate the need for human interaction, effectively providing a central location for forecasts to be accessed. By automating, we can provide consumers of the forecasts with mechanisms to grab the data in the form that they need.
SEE the world with us
Many of our employees have had the opportunity to take advantage of our Skyscanner Employee Experience (SEE) — a self-funded, self-organized programme to work up to 30 days during a 24 month period, in some of our 10 global offices. There is also the opportunity to work for 15 days per year from their home country, if an employee is based in an office outside of the country they call home.
Like the sound of this? Look at our current Skyscanner Product Engineering job roles.
About the author
My name is David Johnson, a member of the Business Forecasting Squad based in Glasgow. The team focuses on providing Skyscanner with forecasts of app and web user sessions, striving for the accuracy and robustness of it’s service — Haruspex.
When I’m not forecasting I enjoy photography, football and spending time with my partner and our baby daughter Eilidh.