Is Neural Architecture Search really worth it ?

Antoine Yang
4 min readJan 31, 2020

--

In Deep Learning, designing state-of-the-art Neural Networks is a complex process which requires years of engineering and research. Neural Architecture Search (NAS) aims at automizing this process. This exciting field has drawn lots of attention in the last few years.

The absolute test accuracy results on Image Classification tasks indeed look mindblowing (above 98% for the state-of-the-art methods). However, it is essential to keep in mind that these results are the product of 3 components:

  • a search space: what architecture can the NAS method discover?
  • a search strategy: how are we proceeding our architecture search? It can be achieved using reinforcement learning, evolution or differentiable methods.
  • a performance estimation: how are we evaluating our architectures in the end? This is commonly achieved by training the found Neural Network from scratch.

Evaluating search strategies

To fairly evaluate the effectiveness of search strategies, we can compare each of them to their respective random sampling baseline. This baseline consists in randomly sampling an architecture from the method’s search space, and training it following the method’s protocol. We can repeat it multiple times, and on different datasets to have consistent results. Doing so for 8 different competitive NAS methods, we find that each method struggles to significantly beat this trivial and computationally free baseline.

Comparison of search methods and random sampling from their respective search space.
Methods lying in the diagonal perform the same as the average architecture, while methods above
the diagonal outperform it.
Same study on a more chalenging dataset, MIT67 (indoor scene recognition with bigger images)

Outperforming the random sampling baseline should be the goal of search strategies, as they are designed to select the best architectures in the search space, and discard the bad ones. However, recent methods seem to lead to better accuracy without significantly outperforming this baseline.

Fancy search spaces

If we want to know how much NAS automizes the Neural Network design, we are not interested in designing a very fancy search space. A search space commonly used in NAS is the one of DARTS. In this case, Neural Networks are basically composed with stacked cells, defined as directed acyclic graphs. We learn which operations to choose at the 14 edges between 4 nodes (plus 2 inputs, coming from the previous cells) of these graphs. The 8 operations commonly considered are Dilated Separable Convolution (3x3 and 5x5), Depthwise Separable Convolution (3x3 and 5x5), Max and Average Pooling (3x3), Skip-Connection, and Identity.

Representation of a cell with 4 nodes and 3 possible operations

To have a simple overview of the performance of such search space, we can randomly sample architectures and train them following the same protocol as DARTS. All architectures sampled and trained fall into a very narrow test accuracy range (mean 97.03± 0.23%, min 96.18%, max 97.56%) on the CIFAR10 dataset. Even more interesting, arbitrarly changing operations (replacing all dilated and depthwise separable convolutions with traditional convolutions) still lead to a similar distribution.

Histograms of the final test accuracies for architectures sampled from the DARTS search space (214 models) and our modified version (56 models) after training

This suggests that the cell structure inherently allows the model to perform well, no matter what the operations are.

The importance of tricks in the training protocol

Another key ingredient of the final accuracy result is the training protocol used as performance estimation. Having a more complex training protocol, including many tricks such as Drop-Path (randomly dropping operands of the join layers) or Cutout (masking a random square on the image) certainly helps to have better performances. To measure that, we randomly sample 8 architectures from the DARTS search space and train them with diverse protocols. We find that there is a considerable difference (about 3%) of accuracy between the basic training protocol (the one of DARTS with all tricks disabled), and one that includes diverse tricks, similar to the one used in most recent state-of-the-art methods.

Comparison of different training protocols for the DARTS search space on CIFAR10. Same colored dots represent minimum and maximum accuracies in the 8 runs.

This suggests that most recent state-of-the-art results, though impressive, cannot always be attributed to superior search strategies and rely significantly on superior training protocols.

How to overcome these issues ?

  • New search strategies should systematically be benchmarked with an already used search space and training protocol; otherwise, comparison to other methods can only be done adapting them to these new search space and training protocol.
  • To better evaluate whether the search strategy is able to discard bad architectures, the search space should be designed as capable of producing both good and bad network architecture designs.
  • Evaluation on multiple datasets and multiple tasks could help evaluating how well a given method generalizes. Hyperparameters introduced by NAS methods should be either general enough not to require further tuning, or their tuning cost should be including in the search cost.

For reproducibility (which is also an important component in NAS), the code used to produce this study is available at https://github.com/antoyang/NAS-Benchmark. Further insights and details are included in the paper NAS evaluation is frustratingly hard, in references.

References:

NAS evaluation is frustratingly hard: https://arxiv.org/pdf/1611.01578.pdf

DARTS: https://arxiv.org/pdf/1808.05377.pdf

--

--