Neural Architecture Search — The Foundations

Daniel Justus
Digital Catapult
Published in
7 min readJul 4, 2019

Over the past years neural architecture search (NAS) has become an increasingly hot topic in machine learning research, but also its first real-world applications begin to emerge. Commercial offerings like Google’s AutoML as well as open source libraries like Auto-Keras [1] make NAS accessible to the wider machine learning community.

In this blog post we want to explore the ideas and methods behind NAS to give readers a better understanding of the field and help them identifying opportunities for their own applications.

What is neural architecture search?

Modern deep neural networks often have many (even hundreds) of layers of different types [2] and the connections between layers can be varied. For instance, skip-connections [2] and sub-modules [3] are used to promote convergence of the models. The space of conceivable model architectures limitless. Currently, most deep neural network architectures are developed by humans based on their experience through a long and tedious process of trial and error. NAS is trying to automate this process of identifying effective architectures for a given deep learning problem.

Generally speaking, for NAS one has to define a search space, a search strategy and a performance estimation strategy [4] (Figure 1).

Figure 1: The concept of neural architecture search

The search space contains all architectures (often an infinite number) that can in principle be found using the NAS approach. It might contain all sets of combinations of layers, stacked upon each other (Figure 2a) or more complex architectures with skip connections and branching architectures (Figure 2b). To limit the dimensionality of the search space it might contain only submodules of the architecture that are combined to the full model in a predefined manner (Figure 2c). In the case of the search space containing only a single architecture with different sets of hyperparameters, NAS crosses over to hyperparameter optimisation, another — albeit more traditional — subfield of automated machine learning (AutoML).

Figure 2: Illustration of architectures that might be contained in the search space. Image adapted from [4]

The performance estimation strategy is a metric on the search space: For every architecture in the search space, it can return a number that corresponds to the performance of an architecture. Typically, this is the accuracy of a model architecture when applied to a test dataset after training it for a given number of epochs. But also parameters like the computational complexity of training or inference can be recognised by the performance estimation strategy. In any case, the performance of an architecture is computationally expensive to evaluate.

This is why the search strategy is the heart of NAS. It is supposed to identify promising architectures for performance estimation and to avoid bad architectures being tested. We introduce different search strategies, in particular random and grid search, Bayesian strategies, evolutionary algorithms and strategies based on reinforcement learning, in the following section.

Search strategies for neural architecture search

The simplest search strategies are grid search, i.e. systematically screening the search space, and random search, randomly selecting architectures from the search space to be tested by the performance estimation strategy. Both these strategies are viable for very small search spaces, in particular if the problem at hand is tuning a small number of hyperparameters (with random search usually outperforming grid search), but they quickly fail with growing size of the search space.

Bayesian optimisation has already proven its usefulness for hyperparameter search [5,6]. This method assumes that a specific probability distribution, typically a gaussian distribution or a gaussian process, is underlying the performance of model architectures. Using observations from tested architectures the probability distribution can be constraint and the most promising architectures can be stochastically determined and tested in the subsequent step (Figure 3). However, Bayesian optimisation often does not deal well with the discreteness and the high dimensionality of the typical search space of NAS.

Figure 3: Illustration of the iterative Bayesian optimisation process. Image adapted from [6]

Evolutionary algorithms are inspired by biological evolution. Model architectures correspond to individuals in a population that might create offspring (other architectures) or die and get removed from the population. The following, iterative process can be used to implement an evolutionary algorithm for NAS (also see Figure 4):

  • An initial, population of N different model architectures is randomly generated. The performance of each individual (i.e. architecture) is evaluated as defined by the performance estimation strategy.
  • The n highest performers are selected as parents for a new generation. This new generation of architectures might be copies of the respective parents with induced random alterations (“mutations”) or they might arise from combinations of the parents. The performance of the offspring is assessed, again using the performance estimation strategy. The list of possible mutations can include operations like adding or removing a layer, adding or removing a connection, changing the size of a layer or another hyperparameter.
  • n architectures are selected to be removed from the population. This might be the n worst performers, the n oldest individuals in the population or a selection of individuals based on a combination of these parameters.
  • The offspring replaces the removed architectures and the process is restarted with this new population.
Figure 4: Illustration of the steps of evolutionary algorithms for NAS

Evolutionary algorithms show promising results and have been able to generate state-of-the-art model architectures [7,8].

In recent years, reinforcement learning based methods have beed developed for NAS. A controller network, usually a recurrent neural network (RNN) is used to sample from the search space with a certain probability distribution. The sampled architecture is trained and evaluated using the performance estimation strategy. The resulting performance is used to update the properties of controller network using gradient based methods [9] (Figure 5). This process is iterated until convergence or timeout.

Figure 5: Basic layout of a reinforcement learning approach for NAS

Like evolutionary algorithms, reinforcement learning has been show to be capable of creating network architectures that outperform hand-crafted model on popular benchmarks [9,10].

Why we don’t all use neural architecture search (yet)

NAS has been highly successful in creating deep neural network architectures that surpass the accuracy of manually designed architectures. In particular for the popular field of image classification the state-of-the-art has been set by NAS-created architectures, using evolutionary algorithms and reinforcement learning.

This comes, however, at a cost: The computational complexity of these methods is huge since hundreds or thousands of different deep neural networks have to be trained and evaluated until NAS comes up with good solutions. For example, for their reinforcement learning based NAS method Zoph et al. used 800 NVIDIA K40 GPUs for 28 days, i.e. more than half a million GPU hours [9,10], rendering this approach far too expensive for most real-world applications.

Moreover, even with the largest resources a NAS approach is not guaranteed to find a good, or even the optimal, solution. Like any other machine learning algorithm, NAS approaches are susceptible to converge to local, but not global, optimum.

An outlook

While, as described above, the excessive computational cost of NAS is at the moment still a massive barrier to its adoption, current research is exploring methods to mitigate this. In particular methods to determine good weights with which to initialise new model architectures have been shown to significantly reduce the computational cost by accelerating the training of new networks [11,12]. While a strongly constrained search space requires human expertise (to identify promising classes of network architectures) and might exclude novel, improved architectures, it can help to find good architectures much faster [8,10]. Therefore, this might at the moment be a good compromise between computational cost and accuracy of the resulting network.

With these ideas and developments cloud based, automated approaches for NAS might provide an easy way for organisations that don’t have a strong focus on machine learning to benefit from novel machine learning capabilities. However, NAS has also the potential to give researchers and AI-native companies those few percent of additional accuracy that might be needed for a critical application.

However — as always when trying to improve the performance of a deep learning model — it should be checked first whether using an existing model with additional or higher quality training data might have the same or even a bigger impact on the accuracy for less cost and effort.

References

[1] H. Jin, Q. Song and X. Hu, Auto-Keras: Efficient Neural Architecture Search with Network Morphism, arXiv, 2018

[2] K. He, X. Zhang, S. Ren and J. Sun, Deep Residual Learning for Image Recognition, arXiv, 2015

[3] C. Szegedy et al., Going Deeper with Convolutions, arXiv, 2014

[4] T. Elsken, J.H. Metzen and F. Hutter, Neural Architecture Search: A Survey, Journal of Machine Learning Research, 2019

[5] J. Snoek, H. Larochelle and R.P. Adams, Practical bayesian optimization of machine learning algorithms, Advances in neural information processing systems, 2012

[6] B. Shahriari et al., Taking the human out of the loop: A review of Bayesian optimization, Proceedings of the IEEE, 2015

[7] E. Real et al., Large-scale evolution of image classifiers, Proceedings of the 34th International Conference on Machine Learning, 2017

[8] E. Real et al., Regularized Evolution for Image Classifier Architecture Search, arXiv 2018

[9] B. Zoph and Q.V. Le, Neural architecture search with reinforcement learning, arXiv 2016

[10] B. Zoph et al., Learning transferable architectures for scalable image recognition, Proceedings of the IEEE conference on computer vision and pattern recognition, 2018

[11] H. Pham et al., Efficient neural architecture search via parameter sharing, arXiv, 2018

[12] T. Elsken, J.H. Metzen and F. Hutter, Efficient Multi-objective Neural Architecture Search via Lamarckian Evolution, arXiv, 2018

--

--

Daniel Justus
Digital Catapult

Machine learning researcher and data scientist. Working on GNNs, Knowledge Graphs and language models at Graphcore.ai