[Paper] Single-Path NAS: Designing Hardware-Efficient ConvNets in less than 4 Hours (Image Classification)

Training Time 5,000× Faster Than MnasNet, 25× Faster Than ProxylessNAS, 11× Faster Than FBNet. Outperforms MnasNet, ProxylessNAS, FBNet & MobileNetV2

Sik-Ho Tsang
CARRE4
6 min readOct 26, 2020

--

In this story, “Single-Path NAS: Designing Hardware-Efficient ConvNets in less than 4 Hours” (Single-Path NAS), by Carnegie Mellon University, Microsoft, and Harbin Institute of Technology, is presented.

  • The NAS (Neural Architecture Search) problem remains challenging due to the combinatorically large design space, causing a significant searching time (at least 200 GPU-hours).
  • In this paper, Single-Path NAS is proposed, which drastically decreases the number of trainable parameters and the search cost down to few epochs.
  • Finally, hardware-efficient ConvNets can be searched in less than 4 hours.

This is a paper in 2019 ECML PKDD with over 70 citations. (Sik-Ho Tsang @ Medium)

Outline

  1. Multi-Path NAS vs Single-Path NAS
  2. Single-Path NAS: Search Space
  3. Single-Path NAS: Differentiable Runtime Loss
  4. Experimental Results

1. Multi-Path NAS vs Single-Path NAS

1.1. Multi-Path NAS

Multi-Path NAS
  • e.g., for a mobile-efficient ConvNet with 22 layers, choosing among five candidate operations yields 5²² ≈ 10¹⁵ possible ConvNet architectures.
  • These techniques remain considerably costly.

1.2. Single-Path NAS

Single-Path NAS
  • Without having to choose among different paths/operations as in multi-path methods, we instead solve the NAS problem as finding which subset of kernel weights to use in each ConvNet layer.

Different candidate convolutional operations in NAS can be viewed as subsets of a single “superkernel”.

2. Single-Path NAS: Search Space

2.1. Single-Path NAS Search Space

Single-path search space
  • 1st row: A fixed macro-architecture is used, consist of blocks 1 to 7.
  • 2nd row: In each block, up to 4 layers (MBConv) are used. Each layer of these blocks follows a mobile inverted bottleneck convolution MBConv micro-architecture.
  • 3rd row: In each MBConv, it consists of a point-wise (1×1) convolution, a k×k depthwise convolution, and a linear 1×1 convolution.
  • Each MBConv layer is parameterized by k, i.e., the kernel size of the depthwise convolution, and by expansion ratio e, i.e., the ratio between the output and input of the first 1×1 convolution.
  • Each MBConv is denoted as MBConv-k×k-e. Mobile-efficient NAS aims to choose each MBConv-k×k-e layer, by selecting among different k and e.

4th row: MBConv layers consider kernel sizes {3, 5} and expansion ratios {3, 6}. NAS also considers a special skip-op “layer”, which “zeroes-out” the kernel and feeds the input directly to the output, i.e., the entire layer is dropped.

2.2. Searching for Kernel Size

Searching for Kernel Size
  • The weights of the two candidate kernels as w3×3 and w5×5.
  • The weights of the 3×3 kernel can be viewed as the inner core of the weights of the 5×5 kernel, while “zeroing” out the weights of the “outer” shell.
  • w5×5\3×3 is treated as the outer shell of w3×3.
  • The NAS decision is directly encoded into the superkernel of an MBConv layer as a function of kernel weights:
  • where 1(.) is the indicator function that encodes the architectural NAS choice. With the group Lasso term used:
  • where tk=5 is a latent variable that controls the decision (e.g., a threshold value). Then, the indicator function 1(.) is further relaxed as a sigmoid function, σ(.).

2.3. Searching for Expansion Ratio

Searching for Expansion Ratio
  • Similarly, MBConv-k×k-3 layer with expansion ratio e=3 can be viewed as using one half of the channels of an MBConv-k×k-6 layer with expansion ratio e=6, while “zeroing” out the second half of channels {wk,6\3}.
  • By “zeroing” out the first half of the output filters as well, it becomes the “skip-op” path, i.e. the residual connection.
  • Hence, for input x, the output of the i-th MBConv layer of the network is:

3. Single-Path NAS: Differentiable Runtime Loss

  • To design hardware-efficient ConvNets, the differentiable objective should reflect both the accuracy of the searched architecture and its inference latency on the target hardware. Hence, a latency-aware formulation is used:

The first term CE corresponds to the cross-entropy loss of the single-path model.

The hardware-related term R is the runtime in milliseconds (ms) of the searched NAS model on the target mobile platform.

  • Finally, the coefficient modulates the trade-off between cross-entropy and runtime.

The total network latency of a mobile ConvNet can be modeled as the sum of each i-th layer’s runtime Ri, since the runtime of each operator is independent of other operators:

  • In Single-Path NAS, the target mobile platform (Pixel 1) is used to record the runtime for each candidate kernel operation per layer i, i.e., Ri3×3,3, Ri3×3,6, Ri5×5,3, and Ri5×5,6.
  • Specifically, the runtime of layer i is defined first as a function of the expansion ratio decision:
  • By incorporating the kernel size decision, the total run time is:
  • Again, the indicator function is relaxed to a sigmoid function σ(.) when computing gradients. It is shown that the model is accurate, with an average prediction error of 1.76%.
  • Single-Path NAS follows MnasNet training schedule. Finally, the hardware-efficient ConvNet found by Single-Path NAS:
Hardware-efficient ConvNet found by Single-Path NAS (Orange Part is not found by NAS)

4. Experimental Results

4.1. ImageNet classification

ImageNet classification accuracy
  • Single-Path NAS achieves better top-1 accuracy than ProxylessNAS by +0.31%, while maintaining on par target latency of  ≤ 80ms on the same target mobile phone.
  • Single-Path NAS also outperforms methods in this mobile latency range, i.e., better than MnasNet (+0.35%), FBNet-B (+0.86%), and MobileNetV2 (+1.37%).
  • MnasNet reports that the controller uses 8k sampled models, each trained for 5 epochs, for a total of 40k train epochs.
  • ChamNet trains on 240 samples and five epochs each which has a total of 1.2k epochs.
  • ProxylessNAS reports 200× search cost improvement over MnasNet, hence the overall cost is the TPU-equivalent of 200 epochs.
  • Finally, FBNet reports 90 epochs of training on a proxy dataset (10% of ImageNet).

In comparison, Single-Path NAS has a total cost of eight epochs, which is 5,000× faster than MnasNet, 25× faster than ProxylessNAS, and 11× faster than FBNet.

ImageNet classification accuracy with various channel size scales
  • By rescaling the networks using a width multiplier, Single-Path NAS model consistently outperforms prior methods under varying runtime settings.
  • For instance, Single-Path NAS with 79.48ms is 1.56 faster than the MobileNetV2 scaled model of similar accuracy.

--

--

Sik-Ho Tsang
CARRE4

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.