On the Design Space of Deep Architecture Models

Kien Hao Tiet
The Startup
Published in
6 min readJul 1, 2020
This is from “Lost in space” movie. Source from Google Image

Goal: This blog is to introduce readers to a new perspective about finding a new architecture based on statistical analysis instead of using big algorithms like DARTS (Differentiable Architecture Search), ENAS (Efficient Neural Architecture Search), etc.

Recently, Convolution Neural Network (CNN) has gained success through many applications in computer vision, such as, Scene Segmentation, Medical Treatment, etc. Although CNN architectures have been well studied, researchers are aware that hand designed architectures does not produce the best models, and is usually time consuming when developing new techniques to design better models (like from VGG to ResNet). The current trend is to use Neural Architecture Search (NAS) to find the most optimal architecture for any given dataset. Even though NAS promises that we can achieve better performance model than hand designed architectures, NAS takes an enormous computational power to produce such an architecture.

Hence, a portable version of NAS that achieved top performance in ImageNet recently is EfficientNet [1]. The main idea behind EfficientNet is to vary the depth, width and resolution of the input image and the skeleton or backbone architecture. The difference between NAS and EfficientNet is that EfficientNet does not attempt to find the best local cells as NAS, but the dimension of whole network. EfficientNet introduces the new aspect of searching, which is to fix the skeleton architecture and vary the backbone architecture’s dimension of the network (e.g. depth, wide and resolution).

The core idea of EfficientNet has been introduced by researchers from Facebook before. However, the task they aimed to is to assess how good the design of an architecture. There are two big components in the task. First, we need to choose the skeleton or backbone architecture in which we will vary its dimension. This backbone architecture will define a space, and all the models that share the common property with this backbone is referred as in the same model family. The second component is to vary the dimension of the skeleton architecture, which is referred as searching in the design space.

I. What define the space?

  1. Model family: is a collection of architectures that share “some high-level structures or design principles” such as residual connections, or using attention convolution blocks etc. In the paper, the authors only consider the model family with residual connections for its easiness to implement and training procedure.
  2. The design space: There are two main components for a given design space.

A parametrization of a model family such that specifying a set of model hyper-parameters fully defines a network instantiation and a set of allowable values for each hyper-parameter — pages 3 in the paper

Note: Please see appendix for further information on dimension of the network in here.

II. Tools to assess the space

In statistics, to assess the space, we need the distribution of the given space so that we can easily compare between space. This is also an advantage of this approach where we can totally and confidently compare two spaces.

Radosavovic et al suggest using the empirical distribution functions (EDFs) which have the formula as below:

I(.) is an indicator function, and e is the error of a particular model.

Observation: Although EDF seems to be a good fit when we consider the distribution of error the given family will produce, we did not account for the complexity or the efficiency of the family models. To account for these two issues, we can use normalized comparison EDF.

the number c above is referred as the complexity of a particular model in the family, and then the normalized EDF will become:

Following the paper, authors recommend to create the bin of the complexity and then assign the weight uniformly to w_i in the above equation 3

This is the image from the paper

Note:

  1. The ideal sampling size to achieve the curve above is around 100 to 1000 models. The result remains similar above 1000
  2. To assess the graph above, the authors suggest 2 ways which are approximate integral and random search efficiency. Check out page 6 of the paper for further discussion on how to assess the curve mathematically.

III. Case Study: NAS

Note: In the paper, the authors did experiments on ResNeXt before they did case study with NAS. I think it will be redundant if I write the results again here. Therefore, I will skip the experiment part and go straight to the case study with NAS. However, in order to understand this part, please ensure you understand what is NAS (read more about NAS here), and ResNeXt-A and ResNeXt-B. In the paper, ResNeXt-A and ResNeXt-B both have skeleton like ResNet, but its dimensions are sampled differently. Moreover, ResNeXt-A and ResNeXt-B are in ResNet family because they have residual connection.

As we already know that NAS in general recently has gained lots of attention due to the fact that it will give us the architecture that we could not come up with as we do hand design. In addition, NAS usually returns better results on popular datasets compared to hand design models. The downside of NAS is the expensive computation. You can find out more details in my other blog.

So, what is interesting about this part?

First of all, the structure of the cell in NAS models does not change much. However, when researchers change the way the algorithms search for the optimal architecture, they are actually changing the design space of it.

In the graph, researchers did similarly as we discuss above. They sampled models from each algorithm by changing the internal cells (e.g. operator and node selection), depth and width. The observation here is that DARTS family’s design space is the most efficient, and is more reliable due to the steep in the curve (on the left).

The second interesting point is that although the NAS models have shown their success, the graph below demonstrate that we can actually do better job in finding the best architecture only by assessing the design space.

The observation here is that the ResNeXt-B has competitive performance compared to DARTS although ResNeXt-B is the ResNet-based model with variation in depth and width.

The take-out idea here is that if we want to pursue better model, we should come up with better design space. Then, we can focus on how to optimize the searching algorithm, which is also the drawback of this paper. They report that it takes 250 GPU hours in order to sample 500 models and train on CIFAR dataset. On the other hand, this is a big step up compared to the first NAS model which requires 10⁵ hours of GPU.

IV. Idea

There is one thing left I want to share. Although the paper target on design the space for deep learning model in classifying pictures. It reminds me about the multi-tasking problem (both for vision and NLP). As we know, another way of reducing the memory is to have a model to train on different tasks. The drawback of multi-tasking is the difficulty in identifying which tasks should be trained in the same batch so that they do not drag the total performance down. I believe we can use the technique in this paper to somehow understand the difficulty among the tasks based on the design space of the given model. With EDF, we can assess which tasks should be train together to enhance the final accuracy of the model.

Reference

[1] Mixing Tan and Quoc Le: EfficientNet. https://arxiv.org/pdf/1905.11946.pdf

[2] Ilija Radosavovic, Justin Johnson, Saining Xie, Wan-Yen Lo andPiotr Dollar: On Network Design Spaces for Visual Recognition. https://arxiv.org/pdf/1905.13214.pdf

Note: For better view, please visit:

--

--

Kien Hao Tiet
The Startup

I am an enthusiast for new ideas that can be applied in anywhere in life.