# An Introduction to Bayesian Optimization for Neural Architecture Search

Deep learning has seen an explosion of interest since 2012 when Krizhevsky, Sutskever, and Hinton entered Alexnet into the yearly ImageNet challenge, LSVRC, which blew away all the other algorithms. Since then, researchers have developed a huge variety of neural networks for different tasks. For example, convolutional neural networks (CNNs) were designed for image classification, and recurrent neural networks (RNNs) are used for parsing human languages. The key ingredients and layout that make up neural networks, such as dense layers, convolutional layers, pooling layers, batch normalization, and the order of these operations, took years of experimentation and domain knowledge to develop. For example, state-of-the-art accuracy on ImageNet has steadily improved since the dataset was released a decade ago. But for a brand new dataset, it would take a lot of effort to develop new task-specific techniques to achieve great performance.

# Neural architecture search

But what if researchers didn’t have to spend years developing neural network architectures for each dataset? This is where *neural architecture search* (NAS) comes in. Neural architecture search is the concept of using an algorithm (sometimes even a neural network) to learn the best neural network architecture for a given dataset. NAS is a subset of *automated machine learning *(AutoML). AutoML is the process of automating all parts of the machine learning pipeline, including data cleaning, featurization, neural architecture search, and hyperparameter optimization.

Neural architecture search has been around since at least the 1980s, but it recently became popular when Le and Zoph released a paper using reinforcement learning to design a neural architecture for the popular CIFAR-10 and Penn TreeBank datasets. Since the seminal paper in 2016, there have been hundreds of papers on NAS. The paper by Le and Zoph showed a lot of promise, however, the main drawback is that the algorithms required **2000 GPU days** to train. Clearly, this is prohibitively expensive for all but the largest tech companies. Recently, various efforts in cutting down the training time to a reasonable order of magnitude have been successful. In addition to reinforcement learning, many different techniques have been successful in NAS, including evolutionary algorithms, Bayesian optimization, gradient descent, and random search. A great survey paper can be found here, and a great survey talk can be found here.

# Bayesian optimization for neural architecture search

In the rest of this blog post, we will give a brief overview of one method, Bayesian optimization (BayesOpt). BayesOpt is a popular choice for NAS (and hyperparameter optimization) since it is well-suited to optimize objective functions that take a long time to evaluate. BayesOpt is used for NAS in auto-keras, and for hyperparameter optimization in Google Vizier. Bayesian optimization is a sequential strategy for global optimization of black-box functions.

To start, we will define a few key ingredients of BayesOpt:

- fix a dataset and a machine learning task (such as image classification on CIFAR-10)
- define a
*search space A*, a large set of neural architectures. For example,*A*could be the set of all neural net architectures with 20 layers from the set {Conv 3x3, Conv 5x5, Pool, ReLU, Dense}. - define an objective function
*f : A→[0,1]*. Given a neural net*a*in*A*,*f(a)*is the validation set accuracy of*a*after training the architecture. Note that*f(a)*might take hours to evaluate. - define a
*distance function d(a₁,a₂)*between two architectures. A good distance function is quick to evaluate and will have a strong correlation with the function*f*. Formally, we hope that for most pairs*a₁,a₂*in*A*, if*d(a₁,a₂)*is small, then |*f(a₁)-f(a₂)*| is also small. Examples of distance functions include the OTMANN distance or edit distance.

The goal of BayesOpt is to find the architecture *a∊ A* which maximizes *f(a)*. BayesOpt is a *zero-th order optimization method*, which means we do not have information about the first derivative of *f*. (By contrast, gradient descent is a *first-order* optimization method, since the derivative can be computed and used to guide the optimization.)

At a high level, BayesOpt works by first choosing several architectures from *A* at random and evaluating *f(a)* for each of them. Based on these results, the BayesOpt algorithm iteratively chooses new architectures to evaluate. A good BayesOpt algorithm will have a balance between exploration and exploitation. For example, it wants to choose architectures which are close (in terms of the distance *d*) to the best architecture it has seen so far. But the BayesOpt algorithm also wants to occasionally try completely different architectures in case they perform even better. The full algorithm consists of *T* rounds of choosing an architecture *aᵢ* and computing *f(aᵢ)*. The output is the architecture *a** with the largest value of *f(a*)* among all those that were tried in the previous rounds.

Now we give a brief description of how the BayesOpt algorithm chooses the next architecture in round *i+1*, given *f(a₁),…,f(aᵢ)*. Intuitively, the algorithm performs some calculations to determine the most useful architecture to evaluate next.

The most common form of BayesOpt performs these calculations by assuming that the function *f : A→[0,1] *follows a *Gaussian Process* (GP). It starts with a GP prior on *f*, and performs sequential updates based on *f(a₁),…,f(aᵢ)* in the form of a posterior for *f*. In other words, the algorithm makes an assumption about the distribution *f(A)*, namely that it is smooth and that its deviations look like Gaussian noise. The algorithm’s assumptions about the mean and variance of the Gaussian deviations are constantly being updated as the algorithm gathers more data in the form of *f(a₁),…,f(aᵢ)*. Based on the information in the current round, the algorithm chooses the architecture with the greatest chance of giving a large improvement. Specifically, a popular technique is to maximize the *expected improvement* utility function, though other similar functions exist. If the best accuracy observed so far is *f**, then the expected improvement for choosing a new architecture *a* is

The algorithm chooses

This final expression can be evaluated using first-order optimization methods such as Newton’s method.

For the full details of BayesOpt, see this tutorial. There is also a great paper on BayesOpt tailored to machine learning.

# Final thoughts on neural architecture search

We are entering an AI revolution, and neural architecture search, as one of the main pillars of AutoML, is one of the most exciting areas of research. Although NAS algorithms are competitive with state-of-the-art human-designed architectures on a few datasets, we are not yet at a point where NAS algorithms consistently beat human-engineered neural architectures across the board (for example, see this post evaluating auto-keras on CIFAR-10 and malaria detection). Still, with the rapid growth of the area, NAS is sure to be a critical component of any AutoML platform in the coming years.