Bayesian Optimization for Hyperparameter Tuning using Spell

Nikhil Bhatia
8 min readFeb 1, 2019

--

Spell has recently gained significant traction as a service that allows anyone to access GPUs and ML tools previously only available to the largest tech companies. Spell’s command line interface (CLI) provides users with a suite of tools to run deep learning models on powerful hardware. While these default tools already make running models as easy as typing spell run python mnist.py, one of Spell’s most versatile offerings is the ability for users to create custom deep learning Workflows. Spell Workflows allow users to fully automate complex machine learning applications that often require multi-stage pipelines (e.g., data refinement, training, testing).

By the end of this blog post, readers will understand what hyperparameter tuning is, how Bayesian optimization can be used to efficiently tune a model, and how to implement Bayesian optimization for hyperparameter tuning using a Spell Workflow (full implementation can be found in the Spell examples repository).

Hyperparameter Tuning

First, let’s understand what hyperparameters are and how they are tuned. In machine learning, the training process is governed by three categories of data.

  1. Input data (or training data), contains the important data features that will be used to configure your model to make accurate predictions.
  2. Parameters (or model weights), are variables that your model will use to adjust the affect each neuron has on your model’s final prediction.
  3. Hyperparameters are the configuration variables that govern the actual training process. These hyperparameters might include batch size, number of hidden layers, nodes per layer, and remain constant for each training job.

Tuning these hyperparameters over the course of many training runs is essential to helping a model reach optimal predictive accuracy. Unfortunately, this process can be notably hard to perfect given the myriad possible hyperparameter configurations. Common methods for hyperparameter tuning, including Grid Search and Random Search, either naively or randomly test combinations of hyperparameters to find an optimal configuration. While Spell offers Grid and Random Search as a part of their suite of ML tools, these methods can be slow and quickly become infeasible at higher dimensions.

Bayesian optimization, a more complex hyperparameter tuning method, has recently gained traction as it can find optimal configurations over continuous hyperparameter ranges in a minimal number of training iterations. Bayesian optimization addresses the pitfalls of the two aforementioned search methods by incorporating a “belief” of what the solution space looks like, and learning from each of the hyperparameter configurations it evaluates. This process reduces the number of times the model needs to be evaluated and only considers the most promising hyperparameters based on prior model runs.

Bayesian Optimization

So how exactly does Bayesian optimization accomplish this uniquely difficult task? Bayesian optimization works by constructing a posterior distribution of a function (gaussian process) that best describes a deep learning model. As the number of observations grows, the posterior distribution improves, and the algorithm becomes more certain of which regions in the parameter space are worth exploring, and which are not.

At each iteration, a gaussian process is fitted to all known explored points, where an “explored point” constitutes a tested hyperparameter configuration with its associated model output (e.g. validation accuracy, training loss, etc). Then, the optimizer uses the posterior distribution and an exploration strategy such as Upper Confidence Bound (UCB) to determine the next hyperparameter configuration to explore. Rather than directly attempting to optimize the target function describing our hyperparameters’ relationship to our output space, this expensive operation is commonly approximated using a acquisition function. This acquisition function is typically an inexpensive function that can be more easily maximized than the true target function. Below you can see iterations of this optimization process.

(source)

Bayesian Optimization in Spell

One of the many beauties of Spell is the flexibility to implement your own complex tools beyond the default product offerings. Now that we have a better understanding of what hyperparameter optimization is and how Bayesian optimization provides a method to find optimal hyperparameter configurations, I can delve into my implementation of Bayesian optimization for hyperparameter tuning using a Spell Workflow. By the end, you will be able to understand and utilize this workflow to optimize the hyperparameters for any of your own machine learning models!

Our implementation can be broken down into the following four parts.

  1. Using the Spell API
  2. Creating our model’s black box function
  3. Working with a Bayesian optimizer
  4. Parallelizing our implementation

Using the Spell API

Let’s start with one of the important building blocks in this workflow. In order to optimize our model’s hyperparameters we will need to train our model a number of times with a given set of hyperparameters, and Spell’s Python API provides an easy way to do so! For the purposes of this blog post, I will be using a Python CIFAR model that uses convolutional layers to classify images from the CIFAR dataset. To start a training iteration of this model, we just need the following lines of code to launch a run.

# given a set of dummy parameters let's construct and run the 
# following command using Spell:
# 'python cifar.py --batch-size 32 --learning-rate .1'
params = {'batch-size': 32, 'learning-rate': .1}command = 'python cifar.py '
command += ' '.join(['--{0} {1}'.format(k, v) for (k, v) in params.items()])
# spawn a spell run using the Spell API
run = client.runs.new(
machine_type = "K80"
command = command
framework = "tensorflow"
...
)
# wait until the run has completed
run.wait_status(client.runs.COMPLETE)

Simple enough; this is how we will run a training iteration of our model given a set of hyperparameters. Now you might be asking how we evaluate the success of our hyperparameters for a given training iteration. To do this, we specify a metric (e.g. validation loss, validation accuracy) that we will track using the Spell API.

# follow a user specified metric and store the final value for the 
# run in 'metric'
for m in run.metrics(metric_name='val_accuracy', follow=True):
metric = m

We have now started a run using cifar.py with hyperparameters {‘batch-size': 32, ‘learning-rate': .1}, waited for the run to complete, and stored the corresponding validation accuracy for the run. Now let’s discuss the idea of encapsulating this model in a function that our Bayesian optimizer can use.

Creating a “black box function”

Bayesian optimizers are commonly applied outside of machine learning and thus require us to abstract the model we hope to optimize in a black box function. Our black box function will take in a flexible length dictionary mapping hyperparameters to their chosen values for one specific run, start the run, and return the final metric value. We can then call this function with a chosen hyperparameter configuration whenever we want!

def black_box_function(**params):    run = client.runs.new(
machine_type = "K80"
command = command + parse(params) # we know what this does!
framework = "tensorflow"
...
)
run.wait_status(client.runs.COMPLETE) for m in run.metrics(metric_name='val_accuracy', follow=True):
metric = m
return metric

As simple as that, our black box function is complete! Now let’s configure the Bayesian Optimizer and set it up to use our black box function.

Using the Bayesian optimizer

We will be using this implementation of a Bayesian optimizer for this Workflow, but any Bayesian optimizer will do the job! First, we’ll define the three general steps for each optimization iteration.

  1. Ask our optimizer for the next hyperparameter configuration to test
  2. Use our black box function to evaluate our model with this configuration
  3. Register the (configuration, metric result) pair with our optimizer

We then repeat the above three steps until either we are satisfied with our metric output, or until we hit a specifiable maximum number of iterations. Now let’s see how we use this optimizer in implementation.

# import our optimizer package
from bayes_opt import BayesianOptimization
# instantiate our optimizer with our black box function, and the min # and max bounds for each hyperparameter
optimizer = BayesianOptimization(
f=black_box_function,
pbounds={'batch-size': (32, 256), 'learning-rate': (.1, .5)},
)
# define a utility function for our optimizer to use
utility = UtilityFunction(kind="ucb", kappa=2.5, xi=0.0)
# ask our optimizer for the next configuration to test
next_point_to_probe = optimizer.suggest(utility)
# evaluate our model on the chosen hyperparameter configuration
result = black_box_function(**next_point_to_probe)
# register the results
optimizer.register(
params=next_point_to_probe,
target=result,
)

Just like that we’ve completed one iteration of: selecting a configuration to test, testing the chosen hyperparameters on our model, and registering the results with the optimizer. After each result is registered, the optimizer will update it’s internal posterior distribution such that the next suggested point takes the prior result into account. We can then use a for loop to repeat the above process as many times as we’d like. In the final subsection we’ll discuss how to parallelize this process to improve the efficiency of our hyperparameter tuning!

Parallelizing our Implementation

In the prior implementation we can see that this Bayesian hyperparameter tuning process runs linearly: we retrieve a set of hyperparameter values to test, we test said hyperparameters, and then we log the result (rinse and repeat). However, if we’re training a more complex model, each testing step could take 12+ hours to fully train and evaluate a set of hyperparameters. Thus, we’d like to parallelize this process to allow for us to run multiple instances of our model in parallel with different hyperparameter configurations.

However, before we start naively spinning up parallel runs, it is important to understand how our optimizer works. If we run three parallel runs, register those three results in sequence, and then request three new hyperparameter configurations from the optimizer, all three of the suggested configurations will be identical. This is because Bayesian optimization is deterministic, and the internally maximized acquisition function has not received any new information in between the three requests for new configurations.

To properly parallelize, we must maintain the invariant that we can only request a new configuration after a new point has been registered with the optimizer.

Let’s implement a class to maintain this invariant. This class will lock to ensure each parallel thread receives a configuration, tests it, registers the results, and immediately requests the next configuration without allowing other threads to interleave in between the last two steps.

from threading import Thread, Lock
lock = Lock()
class ParallelRun:
def __init__(self):
self.last_param = None
self.last_output = None
def iterate(self, optimizer, f):
lock.acquire()
try:
if self.last_output is not None:
optimizer.register(self.last_param,
self.last_output)
self.last_param = optimizer.suggest(utility)
finally:
lock.release()
self.last_output = f(**self.last_param)

Note that each instance of this class will store its last output, and only that same thread will register the output prior to it requesting the next configuration. Furthermore, it is vital that we lock to ensure multiple threads cannot interleave when using a shared optimizer to register and request the next configuration.

Now let’s update our workflow with this ParallelRuns class to run 10 iterations of hyperparameter tuning, each with 3 parallel runs (a total of 30 runs).

# set up 3 instances of our prior class
parallel_runs = []
for i in range(3):
parallel_runs.append(ParallelRun())
for i in range(10): # create a thread for each ParallelRun that calls run.iterate()
parallel_threads = list(map(lambda run:
Thread(target=run.iterate, args=(optimizer,
black_box_function)), parallel_runs))
for thread in parallel_threads:
thread.start()
for thread in parallel_threads:
thread.join()
# our optimizer conveniently provides the best hyperparameter
# configuration out of the 30 runs
print(optimizer.max)

That’s it! We’ve successfully created a Spell Workflow that uses Bayesian optimization in a parallel fashion to tune hyperparameters for any deep learning model.

The full Spell Workflow can be found here.

--

--