Why Neural Architecture Search is becoming far less expensive

Forrest Iandola
5 min readAug 31, 2019

--

Search times have gone down by 100x

1. Overview

Recently, Neural Architecture Search (NAS) has begun to outperform expert engineers at designing fast and highly accurate deep neural networks (DNNs). However, NAS methods have typically required an astronomical compute cost to search the space of DNNs and select the right one. These searches have required so much cloud computing time, in fact, that it’s cost-prohibitive to all but the most well-funded corporate R&D laboratories.¹ However, with a new breed of NAS methods called supernetwork-based search (e.g. FBNet and SqueezeNAS), these costs have fallen dramatically.

2. Related Work and History

2.1. Manually-Designed Deep Neural Networks

In 2012, AlexNet won the ImageNet image-classification competition by a substantial margin. By 2014, deep neural networks were the leading approach for solving many problems in computer vision. From 2012 to 2016, a substantial portion of the computer vision research community focused on designing DNNs that achieved the highest possible accuracy on computer vision tasks. While accuracies went up, so did the resource requirements to run these deep neural networks. For example, VGGNet won ImageNet in 2014, and it required 10x more computations than AlexNet, and it had 2x more parameters than AlexNet.

By 2016, it was becoming clear that the the race to increase accuracy by bloating the DNN’s resource requirements was unsustainable. SqueezeNet, which my colleagues and I published in 2016, demonstrated that reasonable accuracy on ImageNet can be achieved on a minuscule (under 5 megabyte) budget of parameters. In 2017, MobileNetV1 demonstrated how to dramatically reduce the quantity of multiply-accumulate operations (MACs) relative to SqueezeNet and other prior work. Shortly after MobileNetV1 was published, ShuffleNetV1 debuted. ShuffleNetV1 is a DNN that is optimized for low-latency on mobile CPUs, and the ShuffleNetV1 paper also showed that reducing MACs is only loosely correlated with reducing latency. Our recent post, “Not all TOPs are created equal,” provides more context on why neural networks with fewer MACs don’t always achieve lower latency.

2.2. Reinforcement Learning (RL) based NAS

By 2018, Neural Architecture Search had begun to build DNNs that run at lower-latency and produce higher accuracy than previous manually-designed DNNs. For example, Google’s MnasNet outperformed ShuffleNet in both latency and accuracy on ImageNet. MnasNet was generated using a Reinforcement Learning (RL) based system for Neural Architecture Search. The search process operates as follows. There is an RL controller that proposes DNN architectures to train. The DNNs are trained, and the results of these training runs are provided as feedback to the RL controller, and the RL controller then proposes more DNNs to train.

RL-based NAS is highly effective in producing state-of-the-art DNNs that are both fast and accurate. However, the RL-based searches require an extremely high computing cost. A realistic scenario is that an RL based NAS trains 1000 DNNs, which each require an average of 1 GPU-day of training time, for a total of 1000 GPU days of training time.¹ At today’s prices on Amazon Web Services, this means that one search using RL-based NAS can cost over $70,000 in cloud computing time.²

2.3. Supernetwork based NAS

As we saw in the previous section, training 1000 separate DNNs is an expensive proposition. In late 2018, new papers DARTS and BASE introduced a supernetwork based approach to neural architecture search. Rather than training 1000 separate DNNs, we can now “turn the problem inside out” and train one DNN that contains quadrillions of different DNN designs, at the cost of about 10 DNN training runs. Then, FBNet (Facebook Berkeley Network) demonstrated that supernetwork based NAS can produce DNNs that outperform the latency and accuracy of MnasNet using just 10 GPU-days of search time — that’s a 100x reduction in search time relative to MnasNet.³ That brings the search cost from over $70,000 to around $700 of AWS cloud GPU time. To state that an other way, 10 GPU days is just over 1 day of time on an 8-GPU machine that you could fit in your closet.⁴

What’s the difference between RL- and Supernetwork-based NAS? Note that our actual experiments have more layers and module types than are shown in this figure.

3. Supernetwork based NAS for automated driving applications

Automated Driving (AD) is one of the most demanding applications of deep learning and computer vision, and it requires high accuracy and low-latency in real-world conditions. One of the most computationally-intensive computer vision problems in AD is semantic segmentation, which is used to identify the precise outline of the road, the lanes, pedestrians, and more.

Recently, my colleagues and I used supernetwork-based NAS to design a DNN for semantic segmentation (i.e. identifying the precise shape of the road, lanes, cars, and other things). We configured our NAS system to optimize for high-accuracy on the Cityscapes semantic segmentation dataset while achieving low-latency on a small automotive-grade computing platform. Our supernetwork-based NAS produced a family of DNN models that we call SqueezeNAS. Our SqueezeNAS models achieved a higher latency-accuracy curve than Google’s MobileNetV3. This is particularly exciting, because MobileNetV3 was designed by the highly-expensive (often 1000+ GPU-days) Reinforcement Learning based NAS approach, and our search time was quite low — only 10 GPU-days to search for an optimized SqueezeNAS DNN model.

4. Outlook on the future

With the rise of supernetwork-based NAS such as FBNet and SqueezeNAS, it is now possible to produce state-of-the-art DNNs for many applications with relatively low search time (on the order of 10 GPU-days rather than 1000 GPU-days). Further, I expect that supernetwork-based NAS will be utilized to design neural networks for a vast array of problems and domains:

  • any deep learning task: (semantic segmentation, object detection, depth estimation … or, natural language processing or speech recognition)
  • any computing platform: (server, mobile, IOT) and (CPU, GPU, TPU)
  • any sensor: (camera, lidar, radar, microphone)
  • any goal(s): (accuracy, parameter size, MACs, latency, energy)

Footnotes

¹ The Google engineers who developed MnasNet reported that their architecture search required 288 days worth of runtime on Google TPUv2 Pods. By our estimate, this is equivalent to over 1000 NVIDIA V100 GPU-days of runtime.

² As of August 2019, Amazon Web Services charges $24.48 per hour to use a “P3” server with 8 NVIDIA V100 GPUs. $24.48 / 8 = $3.06 per GPU-hour. 1000 GPU-days would cost $3.06 * (24 hours per day) * (1000 days) = $73,440.

³ After the NAS identifies the right DNN design, some further training is sometimes required.

⁴ Scaling DNN training across many processors has become quite efficient in recent years.

--

--