The second paradigm shift in AI
With the advent of deep learning there has been a major shift in how AI engineers work. Now it looks like there might be another shift on the horizon: Neural Architecture Search. The rest of the post will give you a quick tour on how we got here, what this new shift is all about, and what its potential implications are.
This post assumes that you have a rough idea about neural networks or that you are not scared away by being introduced to a small number of technical terms (most of them are linked to Wikipedia for a more comprehensive explanation). For the technically interested reader with a background in machine learning, I provided detailed footnotes with explanations and links to the respective academic papers.
In this post, I want to share my thoughts on the recent developments in AI. First, I will guide you through a brief history of AI and how the rise of neural networks is widely seen as a paradigm shift. Second, I will introduce Neural Architecture Search, a new method of designing neural networks, that might lead us to a second paradigm shift in AI. Third, I will give some closing remarks, what this means for the future of AI software & hardware as well as for businesses.
The hype around AI
Artificial Intelligence is everywhere. Every major company nowadays claims to be using AI somewhere in their product portfolio. It seems like the quote that once worked for Big Data can be reused perfectly.
AI is like teenage sex: everyone talks about it, nobody knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it…
When talking to customers we have experienced that it is often very hard for people outside of the field to differentiate buzzword-dropping from real experience. However, unlike the last two AI winters in the 70s and late 80s, people start to realize that instead of steering into a third one, this time AI is ready to make the transition from research to industry.
This new wave of excitement about AI is strongly connected to the establishment of deep learning, which received renewed interest in 2012 with the contribution of Alex Krizhevsky and his colleagues¹ from the University of Toronto to the famous ImageNet challenge. His “AlexNet” was able to outperform all other competitors who used “traditional” machine learning models. For many, this was a paradigm shift in AI.
The first paradigm shift: from feature engineering to architecture design
Let us take a step back and see how we got there. When talking about AI, for the longest time people have been referring to the subfield of machine learning, or “traditional machine learning” for the sake of this post. The algorithms and methods are mathematically and statistically validated and their behavior has been studied in detail. Most powerful ensemble models², which combine multiple learning algorithms to obtain a better performance, were developed in the 90s and are still widely used today. Especially when working with tabular data, they are often the first choice. However, one rarely achieves top performance directly on the raw data with only minimal pre-processing. Usually, we have to go through an often painful and time-consuming task: feature engineering. It can be described as the process of using domain knowledge or statistical insights to transform the variables available in the dataset in order to find patterns.
For years, the basis of building a good model was this creation and selection of features. In vision tasks, increasingly powerful feature extractors facilitated the problem. By thinking of sophisticated new methods³, people redefined the limits of computer vision. Then there was AlexNet, pushing the error in the ImageNet classification challenge from 25.8% in 2011 down to 15.3%. The academic community witnessed a revolution overnight. One year later, all of the top ten submissions to the challenge were based on deep learning.
Over the following years, neural networks beat state-of-the-art models also in other domains e.g. face recognition, object detection, machine translation, or speech recognition by a large margin. Soon “end-to-end training”, where a model learns to predict its output directly from the input without intermediate stages, became the gold standard. This marks the first paradigm shift in AI: from feature engineering to architecture design. The question “which features did you use?” made room for “which architectures did you try?”. The new job of the engineer was to conceptualize powerful and flexible architectures that would perform feature extraction themselves. Thereby, the engineer’s daily tasks moved up one abstraction layer with him becoming the architect and babysitter of the models. This made the design process the costliest and most thought-intensive task.
A key driver of this first paradigm shift was rapidly increasing computational power and efficiency. Neural networks and their learning procedure backpropagation were nothing new⁴, but only when researchers and Nvidia pushed training to GPUs, gaining speed-ups of 100x, training models with millions of parameters became feasible, annihilating the doubts of the previous AI winters.
The second paradigm shift: from architecture design to search space selection
With the first paradigm shift being basically completed we might ask ourselves: what comes next?
Well, I think it is fair to say that 2017 gave a good answer to that question: Neural Architecture Search, a method that uses a neural network to search and come-up with new architectures.
As we have already seen from the first paradigm shift, the ideas that drive change do not have to be new. There have already been attempts of automatically designing neural networks during the second AI winter in the late 80s.⁵ However, these ideas were victim to the same timing misfits that messed with so much of the early research in AI: computational power was simply not there yet.
Fast forward to early 2017: the researchers from Google Brain⁶ publish “Neural Architecture Search (NAS) with reinforcement learning”, reactivating the hype around automatic architecture design. Instead of designing a network architecture by hand, another neural network acts as “the controller” and is proposing new architectures. For this task, a Recurrent Neural Network (RNN) is used, which takes the proposals it has made in earlier time steps into account when making new decisions. When an architecture is proposed, it is trained until we have a good estimation of its performance. The respective accuracy is used as feedback to the controller before it makes the next proposition.⁷ This method of using feedback or reward for training purposes is called reinforcement learning, inspired by behavioral psychology.
The first full-fledged neural network designed by NAS (or AutoML, as Google is calling it now) named “NASNet” was announced in November 2017 in Google’s research blog. It achieved state-of-the-art performance in image classification and surpassed the best human-designed model in object detection.⁸ Even though NAS is a very attractive method, they could not just release it onto the large ImageNet dataset, since it is too computationally expensive. Instead, they used a smaller dataset to find essential building blocks that could be re-used when constructing an architecture suited for larger datasets.⁹ The outcome was the two cells shown in the figure below: the normal cell (NC) and the reduction cell (RC). The NC keeps the dimension of the input while the RC shrinks the width and height by a factor of 2. To give you an idea about the size of the model: the best performing NASNet contained four RCs distributed between 18 NCs. At least now we should get a sense of why designing better and better architectures is not exactly a trivial problem.
And there we have it: the role of the engineer has changed once more. Assuming a well-defined search method is in place, the engineer’s new routine is search space selection: choosing operants (the yellow boxes) and combination operants (the green and pink boxes) in an “as-little-as-possible-as-much-as-necessary” manner.¹⁰
But it is not just the architecture design engineer that could face a major transition — the concept of Neural Search can be applied to other domains. Two other important concepts in deep learning are activation functions (determining, how a neuron is transmitting information) and mathematical optimizers. The current state-of-the-art method for both of them has also been created by Neural Search.¹¹
If now you are thinking, you might want to apply NAS to your own project, do not get over-excited. The creation of the NASNet cells took 4 days on 500 very powerful Nvidia P100 GPUs. The good news: there are a lot of efforts out there to make this search more efficient. Just before the end of 2017, the updated “progressive NAS” was released, which significantly reduced the number of architectures to be searched from 20,000 to 1,280.¹² Nevertheless, the computing power required is still far from being feasible on a not-Google-scale cluster.
As always, this all sounds super exciting but too research-heavy and too far away from the real world…until it is not. On January 18, 2018, Google released their alpha Cloud Auto ML into the wild. They claim you will be able to
train high-quality custom machine learning models with minimum effort and machine learning expertise
In this way, the experience and skill set needed to design state-of-the-art architectures will get commoditized, as it once happened to the art of feature engineering. The second paradigm shift in AI has started: from architecture design to search space selection.
So, what can we expect for the future? If we have learned something from history, then that the greatest software cannot fully flourish without the hardware to do its part.
Even though we are hitting the limits of Moore’s law, in the context of deep learning we can still expect exponential gains in computational power due to the increasing specialization of hardware. Nvidia had its monopoly for the longest time. As Tim Dettmers is postulating in his blog post, we are currently in a deep learning hardware limbo. Yet once we have overcome it, faster chips, optimized for processing neural networks, are waiting on the other side. Veterans like AMD and Intel will claim their share with Vega and Nervana, Google will push forward in commercializing their TPUs and new players as Graphcore might also get a shot at introducing IPUs. With all that extra power, we might actually be able to make NAS feasible for a larger group of people and even design individual architectures for every single dataset.
And on the software side? I am very confident that NAS will continue to attract research in 2018 to further increase its efficiency. But maybe, we will also see our “learning Inception” go one step further? If we put it in one sentence we already have
an RNN, which reinforcement learns to build great architectures that learn relevant features to learn about a task at hand.
Chances are someone will try to train a higher level RNN to teach a lower level RNN how to do NAS more efficiently.¹³
So what does that mean for all of us?
As a spectator, you can keep enjoying the cool products that have been augmented with AI, no matter if the architecture was built by a machine or a human.
As an AI engineer, you should be aware what the capabilities of Cloud AutoML are and know its limits. However, I seriously doubt that you have to fear for your job anytime soon.
As a business executive, you should not get your hopes up too high. If you want to give your business a kickstart in AI, there might be no other way than to hire or employ experts in the near future.
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
 ^ ImageNet Classification with Deep Convolutional Neural Networks was published by Alex Krizhevsky, Ilya Sutskever, and one of the godfather of deep learning, Geoffrey E. Hinton. The model was an 8-layer Convolutional Neural Network (CNN), with five convolutional, three pooling and three fully connected layers, followed by a softmax.
 ^ Common examples are random forests and gradient boosting. The newer version extreme gradient boosting (xgboost) currently is the most popular algorithm on the Kaggle platform in “traditional” machine learning tasks.
 ^ In the original paper from Feb 2017, Barret Zoph & Quoc V. Le just performed NAS on the CIFAR10 dataset, using the REINFORCE algorithm by Ronald Williams to estimate the parameters of the RNN. The RNN represents a policy which generates a sequence of symbols (actions) specifying the structure of the CNN. In the paper they try 12,800 architectures to find the best model, using 800 GPUs for 28 days.
 ^ The RNN itself is trained with reinforcement learning to maximize the expected accuracy of the generated architectures on a validation set. The figure below is taken from the original paper. Note the “computed gradient” described in the figure is used in the context of a policy gradient, not gradient descent. Why does scaling with the accuracy R make sense? For an update of the reward function of the controller, we make a step into the positive direction of the gradient to increase the expected reward. If the previous architecture achieved a high accuracy, the gradient step is scaled by a large factor. In fact, the scaling factor of the gradient is the accuracy R minus a baseline b. For more details, I highly recommend reading the paper.
 ^ In the follow-up paper, Zoph et. al used “module-stacking”, which was already used in the first Inception paper, to make NAS ready for Imagenet. The best version of NASNet-A achieves 96,2% top 5 error on the Imagenet classification challenge, on-par with the SE-Net. Used in combination with a Faster-RCNN, it beat the state-of-the-art model on the COCO object detection by increasing the mean average precision (mAP) from 39.1% to 43.1%.
 ^ This idea might be the equivalent of transfer learning (TL) brought to the NAS domain. However, the dimensions are vice versa. In TL, we use a network trained on a large dataset (most of the time ImageNet) and retrain (parts of it) on a smaller dataset. In NAS, we use a small dataset (e.g. CIFAR10) to design a shallower architecture, consisting of different cell types (here the normal and reduction cell). We can then reuse the cells found, stack them together multiple times to form a larger network, and then train on a large dataset as ImageNet.
 ^ The search space contained:
• 1x3 then 3x1 convolution
• 1x7 then 7x1 convolution
• 3x3 dilated convolution
• 3x3 average pooling
• 3x3 max pooling
• 5x5 max pooling
• 7x7 max pooling
• 1x1 convolution
• 3x3 convolution
• 3x3 depthwise-separable conv
• 5x5 depthwise-seperable conv
• 7x7 depthwise-separable conv
2 combination operants:
• element-wise addition between two hidden states
• concatenation between two hidden states along the filter dimension
 ^ To accelerate the search, Liu et al. used a sequential model-based optimization, which was progressively increasing the model complexity. They combined it with learning a surrogate function, which efficiently identified the most promising models to explore.