Neural Architecture Search: Foundations and Strategies (Part 1)

Aditya Jethani
7 min readJan 25, 2024

--

The reason behind successful adoption of Deep Learning is largely due to it’s automation of the feature engineering process. Earlier, neural architectures were designed manually by putting up various structural and functional “neural cells”. We’ll be discussing the frontier foundations and groundwork of NAS and various theoretical as well as practical insights. This part will, mostly, focus on the theory. Yet, as the saying goes, ‘theory will only take you so far.’ Stay tuned for an upcoming blog where we transition from theory to practice.

1. Introduction

Neural Architecture Search is the next level in automating machine learning. It has proven to be more effective than manually designed architectures, especially in tasks like Image classification and Object Detection, as shown by (Zoph et al. in 2018). NAS falls under AutoML and shares common ground with hyperparameter optimization and meta-learning.

Methods of NAS can be classified into three dimensions:

  • Search Space: It defines which architecture can be represented in principle. This constitutes the typical properties of an architecture.
  • Search Strategies: It details about the ways to explore the search space with exploration-exploitation trade-off.
  • Performance Estimation: It not only focuses on estimating the overall performance but also gives insights to reduce the cost of these estimations.

Let us understand a level deeper into the dimensions.

2. Search Space

If we know typical properties of the neural architecture being utilized then we can reduce the search space but it poses the risk of introduction of human bias. Let us look at some basic search spaces:

  • Simple Chain-structured Neural Architecture (SCNA): Basically, there are ’n’ layers in a sequence where the output of a layer L-1 is the input of a layer L. While the search space is parametrized by ’n’ layers, type of operation like Convolutional or Pooling and other hyperparameters like kernel size, number of filters and strides for a convolutional layer.
Schematic Diagram for SCNA.
  • Complex Chain-structured Neural Architecture (CCNA): Modern designs and evolution of the neural network due to high increasing demands in the output of Deep learning gave rise to more complex structures of “neural blocks”. A foundational CCNA will comprise of ’n’ number of skip connections, branches and additional layer types while the parameters remain the same as SCNA. The main advantage of a CCNA is the modularity it offers in terms of basic structural and functional blocks. It is generally made up of 2 cells: a normal cell for preserving dimensionality and a reduction cell to reduce the spatial dimension. Can you now try to think more advantages of using a CCNA?
Top Left: a normal cell, Bottom Left: a reduction cell and Right: Combination of multiple such cells
  • More advantages include: Reduction in the search space, easy adaptation in different problem statements and overall improvement in the design of the architecture.
  • Now the question arises how to choose the number of sufficient cells to develop an optimal architecture for most common complex problems? A good example could be DenseNet (developed by Cai et al. 2018b). In principle, cells can be combined arbitrarily, e.g., within the multi-branch space described above, by simply replacing layers with cells. Joint optimization of both the macro (whole structure) and micro architecture (individual cell) is very crucial. Another way could be use heirarchial search space consisted with several levels of motifs. You can read more about the optimization here https://ieeexplore.ieee.org/document/7432805 .

3. Search Strategies

Many different search strategies can be utilized for exploration of search space such as Bayesian Optimization, Reinforcement Learning, Q-Learning or gradient-based methods.

Bayesian Optimization performs well for vision architectural problems after which research on NAS made it to mainstream. However, most BO toolboxes are based on Gaussian process and focus on low dimensional optimization problems. Similarly, Reinforcement Learning can be applied where the generation of a neural architecture can be considered to be the agent’s action, with the action space identical to the search space. The agent’s reward is based on an estimate of the performance of the trained architecture on unseen data.

Flow chart for Bayesian Optimization
  • Evolutionary algorithms have been employed for nearly three decades in proposing and optimizing neural network architectures.
  • Recent neuro-evolutionary methods focus on optimizing architecture with evolutionary algorithms, while gradient-based methods dominate for weight optimization.
  • Variations in sampling parents, updating populations, and generating offspring characterize different neuro-evolutionary approaches.

Tree-based models, such as tree Parzen estimators and random forests, have proven effective in high-dimensional conditional spaces for NAS.

A complementary avenue involves neuro-evolutionary approaches, where evolutionary algorithms optimize neural architectures. Initially proposed decades ago by Miller et al. (1989), these approaches have evolved to adapt to contemporary neural architectures.

Reinforcement learning (RL) also plays a crucial role in NAS, with the generation of a neural architecture framed as an RL problem. The agent’s action, representing the architecture, is derived from the search space, and the reward is based on the estimated performance on unseen data. Zoph and Le (2017) initially employed the REINFORCE policy gradient algorithm, later transitioning to Proximal Policy Optimization (PPO), exemplifying the dynamic nature of NAS research.

Reinforcement learning plays a pivotal role, with the generation of a neural architecture framed as an RL problem. The dynamic nature of NAS research is exemplified by the evolution from the REINFORCE policy gradient algorithm to Proximal Policy Optimization (PPO).

Typical RL diagram implemented for NAS

As a last search technique, Continuous relaxation for gradient-based optimization introduces a paradigm shift, allowing direct optimization by considering convex combinations of operations. Techniques like Liu et al. (2019b) propose continuous relaxation to optimize both network weights and architecture, presenting a promising avenue for achieving efficiency in the search process.

Flow chart of gradient-based optimization method

Note: Most of these search techniques are implemented and tested for CIFAR-10

4. Performance Estimation

Evaluating candidate architectures during the search is challenging. Training each architecture from scratch to measure its performance would be prohibitively expensive. Therefore, estimating the potential performance of architectures is crucial to make NAS tractable.

Before we understand the techniques of estimation, let us understand some terminology:

  • Bias-Variance Tradeoff: There is a tradeoff between bias and variance in the performance estimates. Simpler models like learning curve extrapolation tend to have higher bias but lower variance. More complex models like weight sharing have lower bias but higher variance. The ideal approach depends on the search space complexity.
  • Transferability: The transferability of features between different architectures impacts techniques like weight sharing. Higher transferability reduces the bias of weight sharing. But extremely different architectures may have low transferability.
  • Meta-learning: Performance predictors can be meta-learned by generating training data through architecture evaluations and learning to predict their performance. The predictor is a meta-model trained on meta-data.
  • Multi-fidelity Methods: Combining low and high-fidelity estimates can improve accuracy. For example, extrapolation on proxy tasks with final evaluation on target tasks. Or fast predictors calibrated by periodic full evaluations.

Now that we understand the direction of this section

There are many techniques used till date, a few of them are:

  • Weight Sharing: A single “super-network” containing all possible architectures in the search space is trained. The weights are shared across architectures so that evaluating a new architecture simply involves inheriting the weights without retraining from scratch. This greatly reduces the computational cost. However, it has limitations in accurately assessing vastly different architectures.
  • Learning Curve Extrapolation: Instead of training each architecture completely, they are partially trained for a short duration. The learning curve of the partial training is then extrapolated to estimate the final converged performance. This requires fitting an appropriate learning curve model and setting the partial training duration.
  • Proxy Tasks: Candidate architectures are trained and evaluated on smaller proxy datasets first, like CIFAR-10. The architectures that perform well on the proxy task are then transferred and fine-tuned on the actual target dataset. Using a proxy dataset provides a much faster signal for architecture performance.
  • Ensembling Models: Performance is estimated by ensembling the predictions of low-fidelity teacher models. The teacher models are quicker to train so an ensemble can cheaply estimate potential architecture performance.

Estimating performance is like speed dating for neural networks — rapidly assessing compatibility before full commitment. Just don’t rely on horoscopes to predict ideal partnerships!

Here’s what we learnt in this part of the blog summarized at finger tips:

  • NAS automates neural architecture design through efficient search and estimation
  • Key dimensions are search space, search strategy, and performance estimation
  • Common search spaces include chain-structured and modular cell-based architectures
  • Search algorithms include Bayesian optimization, reinforcement learning, evolution, and gradient-based relaxation
  • Estimation techniques involve weight sharing, extrapolation, proxies, and predictive models
  • Tradeoffs exist between search flexibility, estimation accuracy, bias, and computational cost
  • Upcoming post will provide practical insights and recent innovations in implementing NAS systems.

That was about it. Stay tuned for the next part and if you found this one worth your time, do give it a 👏.

Connect with me :

adityajethani11@gmail.com

--

--

Aditya Jethani

I turn data to decisions epoch by epoch. Designing, Implementing and optimizing AI/ML codes is my forte.