From Sequential to Parallel, a story about Bayesian Hyperparameter Optimization

Published in

Riskified Tech

7 min readApr 26, 2021

In 2017, Riskified created its first training platform, the goal was to manage the full model life cycle -from the data set creation to model evaluation. My role in this project was to write the training component.

The platform was a great success and helped the research team and Riskified grow. However, the Riskified of 2017 was not the Riskified of 2019 and what we trained in one month in 2017, in 2019, we can train in one day.

In 2019, Riskified decided to take its training platform to the next level and created a multidisciplinary team to make it happen. The team included 2 software engineers, a data engineer, and myself as the data scientist.

The task force goal was to create a stable and scalable model training platform for research. One of our success criteria was to reduce the full training process to just hours.
This was a big task to take on because it included data selection, preprocessing, model training with hyperparameter tuning, and model validation. We moved the whole process to Spark, using Scala and Python and running on Kubernetes, which dramatically reduced the runtime. However, we still had a significant bottleneck in the Hyperparameter Search (HPS) process.

The Hyperparameter Search (HPS) Challenge

If you are a data scientist, you know what a headache and time-consuming process the HPS can be; but for those readers that don’t know… Just imagine that for each final model you would like to train, you have to train dozens or hundreds of intermediate models.
Also, you need to be sure that the improvement is not by chance, so you need to do cross-validation that requires training additional models. This is essential to improve the accuracy of the model.

The previous version of HPS could take between 12 hours and 2 days (and some greedy testing could run for more than a week). This sounds fine, considering that Riskified analyzes millions of orders per day. However, a long training model will affect your work and your ability to get things done.

Long research cycles mean fewer things that we can test, and the fewer things we test means less innovation, which at the end of the day can be translated into less accurate models.

Our problem was that we needed to reduce the run time of the HPS, without deteriorating our models’ accuracy. Unlike with most of the challenges that we had in the task force, this couldn’t be solved by writing more efficient code or changing the infrastructure of the platform.

Bayesian Optimization

For the last few years, we have been using Bayesian optimization during our hyperparameter tuning. The adoption of the Bayesian optimization was a complete game-changer for us. The problem is that the same thing that makes Bayesian optimization efficient is also what makes it slow.

The Bayesian hyperparameter optimization is a sequential process that tests and trains a model at each step, using a specific set of parameters. The algorithm uses the information from all the previous steps to define the set of parameters that will be tested in the next step/iteration. By doing this, the algorithm focuses the search on the most promising parameters. To really understand how this algorithm works you can check out a fantastic paper from Agnihotri & Batra, “Exploring Bayesian Optimization”.

For us, it is enough to think about the Bayesian search as a black box that tells us the next set of parameters that we should test, given the model’s performance in previous trials.
This is a sequential process that tries to solve the Exploration-Exploitation tradeoff. Sometimes, the algorithm focuses the search on high-performance areas and sometimes prefers to explore new areas.

The goal of this process is to find the best set of parameters, and our goal was to reduce the run time required for the algorithm to find the optimum set of parameters. But how could we reduce the run time without messing up the algorithm?

The intuitive option is to try to improve/optimize the Bayesian model and the parameter selection, but we didn’t have time to start in-depth research, and at that moment we thought: what would happen if we trained two models in parallel instead of one per step?

A change of perspective, from Sequential to Parallel

Our idea was counterintuitive at the beginning because by training two models simultaneously we would make the Bayesian search less efficient; which doesn’t seem like a good idea when we are trying to reduce the running time of the process.

Nevertheless, we had hope, because this change would make the process easier to parallelize; so if the benefit of the parallelization was bigger than the cost of the inefficiency generated, we would achieve our goal.

Our proposal tries to increment the information generated at each step, but before we deep-dive on this idea, we need to talk about how the Bayesian model defines the next set of parameters that will be tested. For that, we need to discuss the acquisition function.

This function is used by the Bayesian model to choose the parameters that will be tested, the maximum values on the acquisition function represent the most promising set of parameters to evaluate. The acquisition functions are dynamic and represent the expected performance of the model for each set of parameters based on the information that we have.

The graph shows the relation between the Ground Truth, the Predicted distribution, and the Acquisition function. Each new data point is selected base on the maximum of the acquisition function and updates our predictive model, moving it closer to the ground truth.

Given that the acquisition function will be updated after each step, it’s more efficient to train one model, update the function and choose the parameters with the maximum on the new function for the next model. In the parallel process, we will choose the two most promising sets of parameters at a specific moment. This sequential process would lead us to the optimal solution with the fewer number of models trained.

However, our goal was to solve the problem in the shortest time, not to solve the problem by training the fewest number of models.

Let’s imagine that all the model training takes the same training time and that the cost of training one or two models is the same. Given this assumption: Do you agree that if I trained two models at each step instead of one, I would reduce the running time of the hyperparameter search?

Our thought was that parallel search after 4 steps would train 8 models instead of 4 models that would train the sequential search. In other words, after the same amount of time/steps the parallel would have significantly more information. We would be increasing the amount of information produced at each step, and this would help us find the optimal solution faster.

Quickness vs Efficiency

To test the idea, we had multiple machines/workers running and we sent each one a set of parameters to train and evaluate them. The Bayesian model defined the parameters for each model. When each machine finished the training and evaluation of the parameters, they reported the model performance to the Bayesian model; as a response, they received a new set of parameters to test.

The number of models that you should train in parallel is a big question, and I think that the answer will depend on your data and type of model. We found that training 3 models in parallel was a good balance between speed and efficiency. The improvement on the performance of adding an additional model didn’t justify the rise in the cost of the full process.

Conclusion

By training 3 models in parallel, we reduced the overall running time to less than a ⅓ compared with the sequential approach.

Yet, nothing in life is free, and running the HPS in parallel costs 50% more than the sequential HPS; this is as a result of the inefficiency that we generate and the need for more training models in total. Nevertheless, it is a cost that we are more than happy to pay to reduce the run time.

The plot shows the run-time on two different models for our two different use cases, one using the parallel process and the other using the sequential process.

Parallelizing the HPS helps us reduce the running time dramatically, but I think that more importantly, it creates a framework that can be scaled and adapted based on the research’s needs and urgency. This flexibility is crucial for a training platform that can serve multiple products and research types.

I hope that this blog helped you see the benefit of using a Parallel Bayesian Hyperparameter Search and have a clearer understanding of the process.