AI Research @ Scale

At Re:Invent last week, James gave a great presentation on ‘innovating at scale’ — detailing some of the things we do on under the hood relating to networking, data center infrastructure and power at Amazon. It’s a fascinating talk which covers everything from private trans-oceanic network, to custom ASICs, to storage racks which weigh in at 2,778lbs.

All these areas of innovation are driven by customer workloads running at huge scale on AWS today: web and mobile apps like Airbnb, Pinterest and Slack, large IoT deployments, broad usage of analytics and, an area close to my heart, AI research. I had the good fortune to join James on stage to talk about this, and this post is a summary of my presentation (which you can also watch you YouTube).

The Triforce Driving AI Research

AI research — the development and training of new machine learning models — is enjoying new found momentum and focus across virtually every industry driven by the perfect storm of three things: algorithms (some of which are actually pretty old; published in the last decade but enjoying new found usage with the advent of…), huge data collections (modern apps and businesses are fantastic data generators; where collection, storage and analytics were once boxed in by fixed data center walls that couldn’t move, in the cloud these walls melt away — along with the constraints of fixed capacity or limited capabilities), and the availability of computational power (especially GPUs), at almost unlimited scale, priced as a utility. The intersection of these three areas enables anyone with an AWS account and an idea to build more sophisticated models to be trained on larger data sets.

Data scientists, researchers and academics are able to harness algorithms, and then train them at scale.

Deep Learning, Data Science and the ‘Three Ps’

One group of algorithms have really benefited from this perfect storm. Deep learning — a set of statistical machine learning techniques which perform feature detection using a hierarchy of multiple layers, trained with artificial neural networks — have shown great success in solving some of the hardest problems in computer science: natural language processing, image analysis, voice recognition, personalization and autonomy.

However, real world deep learning algorithms are large and complex, often involving thousands of different layers, with an almost insatiable hunger for computational resources for both training and inference. The challenges in building new models break down into three pieces — the ‘Three P’s of data science: programmability (how easy it is to compose and manipulate the network models), portability (how easy it is to move models between platforms, from mobile devices to robotics), and performance (how efficiently models scale in training and inference).

AI Success on AWS

In a way, this isn’t news: customers have been using AWS to build novel AI features (or entirely new products), for years — to name but a few… Pinterest have a really smart visual image search in their mobile app; Netflix make movie and show recommendations; FINRA look for trade anomalies; Hudl improve sports teams performance through video analysis; Stanford have trained deep learning models on fundus images to help early detection of diabetic retinopathy; Wolfram have built Alpha, a computational knowledge engine; Amazon created Echo and Alexa; and TuSimple have created some of the world’s best performing autonomous driving algorithms based on deep learning-trained computer vision.

Many of these new applications use deep learning frameworks, such as TensorFlow, Torch, Caffe, Theano or MXNet, running on AWS, to help address the ‘Three Ps’.

While we’ll make sure that all of these frameworks run really well on AWS (take a look at our Deep Learning AMI and CloudFormation template which come pre-configured to run all of these frameworks), one really stood head and shoulders above the others: MXNet. It’s what powers Wolfram Alpha (and in turn provides factual answers to voice assistants such as Siri, S Voice and Iris), TuSimple’s autonomous driving systems; it’s the framework of choice at AWS and the foundation of our future AI services. We love it, and our customers love it, precisely because it addresses the ‘Three P’s so well.

Programmability and MXNet

MXNet has two really nice programmability features. Firstly, it provides a great breadth of support for different programming models and languages, supporting everything from Python and Javascript, to R and Matlab. It’s tailor made to support the programming languages that researchers are already using, but compiles this code down using a high performance back end to give reliable, high performance in training and inference. Secondly, it provides the benefit of imperative programming (the ability to script and code with loops and conditionals which are easier to develop with), mixed with the benefits of declarative models (which provide more opportunity for optimization, just like with SQL). In fact, the name ‘MXNet’, is derived from this: it’s allows for mixed network models.

Portability and MXNet

MXNet is incredibly efficient when it comes to using and allocating memory: a 1000-layer network can fit inside 4 gig of memory, which means that networks can be trained at scale in the cloud, and then moved from one platform to the next: from mobile and web clients, to robots and drones (TuSimple have computer vision systems which will run on TX1 boards on drones to track and follow objects automatically).

Performance and MXNet

Driven in part by the compiled back-end, opportunities for optimization in a mixed imperative/declarative model, and efficiency in memory usage, and because of automatic parallelization of both serial code across CUDA cores on GPUs, and clever data parallelization, MXNet scales extremely well to multiple cores on a single GPU, to multiple GPUs in a single instance, and multiple GPUs on multiple instances.

In our own benchmarking, we saw 91% efficiency in scaling the Inception v3 image analysis algorithm across 16 GPUs on a P2.8xl; and 88% efficiency scaling up to 256 GPUs on 16 P2.8xl.

That’s just a 3% decrease in efficiency for 16x the scale, across over 600k CUDA cores.

Benchmarking Deep Learning In The Cloud

But don’t take my word for it! One of the benefits of the cloud is that you are able to perform your own due diligence using either your own applications, or . We want these benchmarks to be reproducible, and to provide a starting point for your own performance evaluations, so we have made available open source deep learning benchmark code. Just spin up a P2 cluster using the CFN template, log into the master node, and start running these benchmarks with a single command.

Next Steps

This is an exciting area, especially as it’s so early in the advent of how we can apply deep learning. I’m looking forward to an exciting year ahead.

Starting to experiment with other platforms; cross-posting new articles to Medium from my blog. The original is here.