Improving neural network’s performance with Bayesian Optimization

1. Introduction

When it comes to training a neural network, finding a good set of hyperparameters is not a trivial task. Since many of our projects at Logivan use neural networks in one way or another, we have tried several strategies to improve the performance of our models. Among those, Bayesian optimization appears to be an efficient choice most of the time as it not only helps us find the vector of hyperparameters that results in a network with the lowest error but also boosts the time spent on model tuning remarkably.

In this blog, we will (I) provide an overview of some popular hyperparameters running techniques, (II) go over some high-level mathematics concepts of Bayesian optimization, and (III) compare the performance of different hyperparameter tuning techniques with Bayesian optimization on a toy dataset.

1.1 Naive Grid Search and Randomized Search

Grid search and randomized search play an important role in hyperparameter tuning in machine learning field. However, these methods still contain some disadvantages that make the tuning process suffer from high computational cost. For example, in grid search, we need to list a set of points, that we think, might be the right choices for our model and create a rectangle grid that each point on which is a combination of the selected parameters. Then, through trial and error, we figure out which combination is the right one. Assume that you list out parameters for your model like this

learning_rate = [0.1, 0.001, 1]
weight_decay = [0.7, 0.8, 0.9]

If, for example, learning rate=1 is not a suitable choice, we will still compute model performance using that learning rate with 3 other parameters of weight decay. Obviously, it takes a tremendous amount of time and computational cost for Big Data and Deep Learning problems.

Randomized Search seems to be a better redemption as it chooses the candidate points randomly according to the parameter’s distribution, not the specific ones by users. However, these random points may lie in the parameter space that cannot improve the model’s performance. This can lead to a long waiting time to find the best parameters. That’s why Bayesian statistics comes into the game.

1.2 Tuning with Bayesian style

Bayesian Optimization algorithm seems to be an innovative step in hyperparameter tuning since it redeems the drawbacks of Grid Search and Randomized Search.

Assume that we have a set of parameters x and our objective function f. This objective function might return the loss value, accuracy, mean squared error, or anything we attempt to maximize or minimize. Clearly, f is expensive to evaluate since we don’t know its closed form, its structure, and properties like convexity, concavity, linearity, and the existence of first or second-order derivatives. So f is similar to a black-box function. After we input the range of each parameter that needs to be tuned, Bayesian algorithm initializes some random points x to evaluate f. Then, it uses Gaussian Process (GP) as a surrogate model (because we don’t know anything about the black-box function) to mimic the structure of that black-box function. Finally, the algorithm optimizes an Acquisition Function defined from the surrogate model to choose where to sample next in parameter space. This process is iterated over and over until reaching the stopping condition or convergence.

From the above steps, we first see some advantages of Bayesian Optimization algorithm:

1. The input is a range of each parameter, which is better than we input points that we think they can boost model performance.

2. Candidate points are randomized to make sure our model does not spend

too much time on bad parameters.

3. Bayesian Optimization can balance between exploration and exploitation because the algorithm can sample points that it thinks the optimal value will locate after exploring the parameter space.

4. Instead of evaluating a black-box function. We work on a surrogate model

that makes the optimization process easier and more efficient.

In the next part, we will walk through some mathematical theories in the algorithm. Because this article is mainly for newcomers in Machine Learning field, we will explain some parts of Bayesian Inference, introduce Gaussian Process, which is a surrogate model for the black-box function we need to optimize. Finally, we introduce one of the most common Acquisition Functions: Expected Improvement (EI) that helps us to find the next point to sample and optimize the tuning process.

2 Mathematics Intuition

2.1 Problem Introduction

Bayesian Optimization is a class of machine-learning-based optimization methods focusing on solving this problem:

Here, we can think:

  • f as a function of negative loss value (because we want loss as small
    as possible so we need to maximize negative loss value), negative mean
    squared error, accuracy, . . .
  • x as a combination of parameters we need to tune and A is a parameter
    space.

Usually, f is expensive to evaluate and we lack information about f’s structure or properties. Therefore, we need the Gaussian Process as a surrogate model for f.

Maybe you knew about Maximum Likelihood Estimation (MLE). Consider a data sample X. Say we want to identify the distribution of X. In MLE, we assume X follows a certain distribution with parameter θ, i.g X g(⋅∣θ). Next, we compute the likelihood of X:

The MLE is

Since we assume that X g(⋅∣θ), we want to estimate θ that makes the distribution g(⋅∣θ) best describes our sample X.

Bayesian Inference based on Bayesian Formula:

In Bayesian Inference, we have the data sample X and we want to find the distribution based on parameter θ.
Because we assume θ q meaning that we have a little information about θ (like in slot machine game, you can play several times to have a small sample). That’s why we call distribution q is prior and the distribution of p(θX) is posterior.

So, the output of p(θ X) is a distribution of θ given X. Clearly, this output is better than a point estimator in MLE because it contains much information of a random variable so that we can compute its expectation, variance or point estimator by maximizing a posterior (MAP).

2.2 Gaussian Process

GP is a Bayesian statistical approach for modeling functions. In this section, we have a brief introduction to GP and use this model as a surrogate model to describe black-box f.

We first randomly initialize some points x1:n indicating x1,…, xn. Then we compute f at these points and collect it into a vector

We assume that this vector was drawn randomly from some prior probability distribution. Gaussian process chooses a prior distribution for that vector as multivariate normal. And a multivariate normal distribution has 2 parameters mean vector and covariance matrix. We construct a mean vector by using a mean function m(x) calculated at each x_i and construct covariance matrix by evaluating a covariance function or kernel K. There are many ways to choose mean function and kernel but it is another story that we do not discuss here.

So we’ve already built a prior on f(x1:n​):

where

  • m(x1:n​)=[m(x1​),…,,m(xn​)].T
  • K is an n×n matrix with (K)ij​=k(xi​,xj​)

From the above, we can see that GP defines a prior over function. From that prior, we can update the posterior distribution according to Bayesian Inference.

2.3 Acquisition Function

We can think of Acquisition Function as an evaluation function to select where to sample next based on updated posterior. One of the most common function is Expected Improvement or EI:

We can compute this expectation when f follows Gaussian model as following

where

Why is that? From the closed-form of EI function, we can conclude that:

  • EI is higher when (m(x)−f(x^)) is higher. This means the Gaussian model expects that point will have a higher value on average.
  • σ(x) is higher. This means that point contains much uncertainty that needs to explore.

So the tuning process will explore points that might boost the value of f or regions that have not explored much. This makes Bayesian Optimization have huge advantages among other methods because it can balance between exploitation and exploration, making computation procedure more efficient.

3. Applying hyperparameter tuning techniques to predict taxi fares

To experiment with some hyperparameter tuning techniques, we will use the first 5,000 records of the New York Taxi Fare dataset. Let’s take a quick look at the data.

Our goal is to predict the price (fare_amount) of each taxi trip given the other features. To help our neural network learn a little better, we will extract some date time and distance features from the data. The notebook that contains code for that task can be found here.

After preprocessing, we will split the data into a training set (90%) and a validation set (10%). The training data will be a 2-D array of shape (4500,22) that looks like below.

Training data after preprocessing

Now, it is time to define and train our model. For this example, we will build a simple neural network with 512 neurons in the first layer and 256 neurons in the second layer, as shown below.

The two hyperparameters we will focus on are the learning rate and the l2 penalty for regularization. Since we do not know the optimal values for them, we will take a wild guess and assign 0.001 as a baseline for both of those. In that case, the training loop will be as follows.

training(x_train,y_train,x_test,y_test)RMSE on test set: 6.813826084136963

Using root mean squared error (RMSE) as an evaluation metric, our error on the test set after running the above loop is approximately 6.8.

Let’s not discuss whether an RMSE of 6.8 is good or bad, but instead, try to see if we can lower that error with hyperparameter tuning techniques.

3.1. Using grid search to improve our model.

So far, we have trained a neural network and plugged in a value that we guess for our learning rate and l2-penalty. Clearly, if we train our model with a wider range of numbers for those two hyperparameters, we are likely to produce a new model with a lower error. This is also the main idea behind grid search. To illustrate, we will test a lot of parameters in the interval [0.0001, 0.1). Let’s write some code using skorch and sklearn to see if the result is better.

Grid search takes 2448.65 seconds to tuneRMSE on test set is: 5.102670669555664

As we can see, the RMSE improves from 6.81 to 5.1, which is quite significant. However, it took about ~40 minutes to tune the model. Can we do better than that?

3.2. Using randomized search to improve our model.

Instead of trying all the specified numbers in the search interval, we can sample only some random combinations from the search interval and train the model based on those values. Again, let’s write some code to see if it has any improvement over grid search.

Grid search takes 1310.02 seconds to tuneRMSE on test set is: 5.001118183135986

The time spent on tuning has been cut into half. It only took ~20 minutes to run the randomized search. Not only that, the RMSE is 5.01, which is slightly better compared to using grid search. Can we still improve?

3.3. Using Bayesian optimization to improve our model.

Now, we will use Bayesian optimization to determine the values for the learning rate and l2-penalty.

Bayes optimization takes 584.21 seconds to tune{'target': -20.6834774017334, 
'params': {'l2': 0.0018374684275351235, 'lr': 0.003400942482117681}}

From the result, we see that it only took Bayesian Optimization merely ~10 minutes to find good values for our hyperparameters. The choices are 0.001837 for l2-penalty and 0.0034 for the learning rate. It is indeed very fast, but we should check if those two values actually result in a better model.

training(x_train=x_train,y_train=y_train,x_test=x_test,y_test=y_test,lr=0.003400942482117681,l2=0.0018374684275351235,batch_size=64,epochs=30)RMSE on test set: 4.547909259796143

With an RMSE of 4.54, this is the best model we have achieved. Moreover, it is significantly more time-efficient compared to the other two techniques.

4. Conclusion

Bayesian optimization is undeniably a powerful technique to search for a good set of hyperparameters. As shown in the above example, it produces the best model significantly faster compared to using grid search and randomized search. Imagine that instead of only two hyperparameters, we need to tune six or seven of them in a wider range. In that case, performing grid search can become infeasible and searching randomly is not likely to find optimal values.

Can we now guarantee that Bayesian optimization is always the best among the three techniques? Not necessarily. First, it depends a lot on the data and the problem we are trying to solve. Second, if there is no time constraint, applying grid search strategically or repeating randomized search several times can lead to a better result. Yet, in situations where we need to deliver a good model fast, Bayesian optimization can save us a lot of time and effort.

This blog was written by Hiep Nguyen and Man Bui, data scientists at LOGIVAN, under the guidance of Dr. Vu Anh, the lead of LGV data science team. I hope you guys will be in love with our AI-based services. See more about us at

or join us at https://www.logivan.com/en/hiring/.

See you guys in the next posts.

References

  1. Bayesian optimization with scikit-learn
  2. A Tutorial on Bayesian Optimization
  3. Bayesian Optimization Github Package
  4. MLE, MAP and Bayesian Inference

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store