Starting point for neural search architecture

Kien Hao Tiet
Aviation Software Innovation
8 min readJul 1, 2020
The picture is from Unsplash. Credit: Casey Horner

Goal: This blog is a guide for new hobbyists on their way to learn how to search for the new architecture.

Figure 1: The Inception model and this image is captured from Google Search

Yes, above is the Inception network architecture. How long does it take researchers to derive this architecture? In addition, how long do you think it would take a newbie to derive such an architecture where you know you need that many blocks, and within each block, you have to deal many other details like in the figure above?

The current trend in deep learning community is to rely on the computation resources we are having to design the better architecture in terms of performance than prior hand design models. This trending is called Neural Architecture Search (NAS). Although there are already many blogs about NASNet and ENAS, this blog will offer you the general picture of these two techniques pros and cons, and some implementation details. So let’s begin with the general idea of NAS.

I. Everything starts the definition

The main idea of Neural Architecture Search (NAS) is to use deep learning to create another deep neural network to solve the problem without the involvement of an expertise.

However, the reality is not that easy. In the nutshell, NAS has two components:

1/ The controller

2/ The child model

The controller is used to to crawl over all possible way to build a (different) network. At any given time, the controller will be sampling the potential architecture for the search space. This new architecture is called the child model. Then, we evaluate the child network on the dev/test set to see it performance. We will iterate the process until no better model can be found. Only then will we stop.

Image from [1]

II. What Is the Role of the Controller?

There are two things that can be the result of the sampling from the controller:

  1. The whole new complete neural network, which is often called macro search.
  2. An individual cell, which is often called micro search.

An example to help you to understand the difference between (1) and (2) is to think about the ResNet. In the macro search, the controller will generate the whole new complete network in which we can feed the images in, and it will make the prediction like ResNet. On the other hand, when the controller only samples block (like ResNet’s block), we call it is micro search. In the literature, we often call it the cell.

So, what are the pros and cons of those two types of search?

Let’s start with micro search. When we only search for the architecture of the block. We somehow assume the overall architecture of final network. For instance, we usually assume the skeleton of the final network will look like ResNet. We only need to change the components in each block of ResNet to make the network better. The advantage of this is that our search space is not as huge as macro search, but the performance is relatively the same. However, the drawback is that the controller does not have the freedom to design the new architecture from scratch because we are tied to the skeleton of the final network at the beginning.

On the other hand, in macro search, we will give the controller full ability to design the new architecture from scratch without any constraint. Hence, we can see in this sense, the search space will be enormous.

Now, we can move into how the controller looks like.

III. The computational graph

It depends on the models and type of searches, and the controller will be designed differently. For simplicity, I will only discuss designing the cell. In the first breakthrough paper [1], the controller has two parts:

  1. The Recurrent Neural Network (RNN) (authors actually used Long-short term memory in their implementation) to sample the what the cell looks like
  2. The directed acyclic graph (DAG) represents the topology of the cell.
Image from [2]

The image above is how the cell will be constructed. The direction of the edges represent the flow of the data. The weight of an edge is the operation (i.e. conv3x3, conv5x5, identity, etc.) which is applied by the direction. In the example above (from ENAS paper), the cell will be generated by following red narrows. That means the inputs will take place in node 1, and then follow the red narrow to node 2 by a chosen operation.

Note: In the implementation of ENAS, the authors initialize the DAG as the graph above with all the possible operations ready in each edge. At anytime, when DAG receives the instruction from the controller, DAG just needs to switch between operations. That is the reason why ENAS is called sharing the parameters. The code below helps to illustrate the idea here:

In Tensorflow, we do not need to create the graph explicitly like this because we can create the Session and the computation graph will be remained the whole training process. However, in PyTorch, this will be more difficult because we have to explicitly define these operations as the ListModules. Otherwise, the computation will not retain after backward.

IV. How to sample operations for the graph?

Up to now, you may have a good understanding about the computation graph for NAS. The remain question will be how to decide which edges should be chosen and operations to be used on each chosen edge? In [1] and [2], they use LSTM to sample all above information.

First of all, node 1 and node 2 are always the input nodes from last two layers. In the case, there are not enough last two layers, the input images will be used. Thus, we only need to generate the structure for node 3 and 4. As you can see, the LSTM will generate 8 hidden layers. The first 4 hidden layers belong to node 3 where it will be received information from other two nodes (node 2 and node 2 in example) by two operations for each (conv5x5 and identity in example). Then, we add the results of together.

As this process is just for one cell. The final architecture is created by stacking many cells. Therefore, the LSTM is used to generate 4 * (number of nodes in each cell — 2) * (number of cells for the final architecture). In addition, there are two types of block: the reduction cell (used to reduce the size of the images like in ResNet) and the normal cell.

Note: There are two points I found interesting in the ENAS code that are not mentioned in the paper:

  1. The authors calibrate the size before they add two operations while they are applying the operation on the inputs
  2. Let’s say we choose conv5x5 to apply on inputs from node 2 like example. The authors actually calculate for all operations and then they pick the outputs equivalent to conv5x5. I am not too sure what is the purpose of this since it causes the program to use more computation.

V. How to train the LSTM?

The last question is how to train the LSTM to generate better models? The answer is based on the score which the child model achieves on validation set. If we think in this way, we can illustrate this as Reinforcement Learning task in which the score on validation set is the reward for the controller.

Note: I think this idea is important although it takes enormous computation resources to find the best model at the end. If you can grab the idea, the variation versions of NAS will be easy to understand such as DARTS, PNAS, etc. Moreover, beside finding the best architecture, there are efforts of trying to find the most optimal optimizer, batch norm, etc. which are all based on the computational graph like this.

V. What Else We Need to Work on The NAS Area?

  1. We can work on the assumption about the skeleton of the network. This work [4] demonstrates an efficient way to generate the residual connections among cells efficiently.
  2. The feedback is used to sample better models. We already see that RL is problematic in this type of problem. There are many efforts on trying to avoid RL such as DARTS [5] which relax the probability of choosing the operations on the edges in the computation graph. This approach only use gradient descent to optimize the controller. Or, we can use a complete different way to manipulate the sample policy of the controller like the uncertainty in behavior of its child models on the validation test?
  3. Another bottleneck is that we only evaluate one child model at the time. Evolution technique [6] demonstrates an efficient way to evaluate a sample from the population at the time. Another advantage of this approach is that we can use the current state-of-the-art models such as ResNet, VGG, etc. This is the guide for us to find better architecture through the evolution cycle. However, I think this may be the drawback since we are biased towards our well-established models as the starting point.
  4. Last but not least, we can improve the searching space instead of focusing only on the searching algorithm. This paper also suggests interesting way to assess the searching space of NAS family model. You can read my blog here.

Reference

[1] Barret Zoph and Quoc V. Le. Neural Architecture Search with Reinforcement Learning. https://arxiv.org/pdf/1611.01578.pdf

[2] Hieu Pham, Melody Y. Guan, Barret Zoph, Quoc V. Le and Jeff Dean: Efficient Neural Architecture Search via Parameter Sharing. https://arxiv.org/pdf/1802.03268.pdf

[3] Chenxi Liu, Liang-Chieh Chen, Florian Schroff, Hartwig Adam, Wei Hua2, Alan Yuille1 and Li Fei-Fei3: Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image Segmentation. https://arxiv.org/pdf/1901.02985.pdf

[4] Hanxiao Liu, Karen Simonyan and Yiming Yang: DARTS: Differentiable Architecture Search. https://arxiv.org/pdf/1806.09055.pdf

[5] Esteban Real, Alok Aggarwal, Yanping Huang and Quoc V. Le: Regularized Evolution for Image Classifier Architecture Search. https://arxiv.org/pdf/1802.01548.pdf

Note: For better view, please visit:

--

--

Kien Hao Tiet
Aviation Software Innovation

I am an enthusiast for new ideas that can be applied in anywhere in life.