The Scaled ML Conference 2018

10 min readMar 26, 2018

The third Scaled Machine Learning Conference took place at Stanford University, March 25 2018. Conceived by Stanford Professor and Matroid CEO, Reza Zadeh with the support of pioneers from academia and industry, the conference aims to foster discussions on scaling up machine learning algorithms on a variety of platforms as well as encourage practitioners for these platforms. This year it brought together researchers and practitioners from Google, OpenAI, Nvidia, Stanford and others for 11 sessions. I was there!!

The ML “Avengers” (Image Source : Twitter)

I have to say that attending was in fact the right decision as several actionable conversations came out of the conference ; my head is absolutely packed with new ideas, methods, and papers to read. Below, I summarize some of the notes I jotted down during the sessions I managed to make it to.

Since this report is long, the reader who only wants the highlights of this report can: (1) look at every Figure and caption, (2) review the relevant slides and videos here. Please feel free to distribute it and comment below if you find any typos.

RL Systems at RISE Lab — Ion Stoica [Slides]

Ion Stoica began the day with an interesting talk on the need for Real-Time, Intelligent Secure Explainable (RISE) Systems. By way of motivation, he provided a compelling example of how next generation AI applications will be very different; not only would they have to be deployed in mission critical scenarios but also they are required to continually learn from a rapidly changing environment.

By presenting three dimensional perspective of scaling ML viz. data, model training and replication, he argued that building the next generation of AI applications requires a broader range of techniques — systems need to have intrinsic support for parallelism, stochastic optimization and heterogeneity to empower real-time decisions. Meeting these requirements is not easy.

Ray provides a unified distributed learning platform for implementing emerging AI applications. Developed at the RISE lab at UC, Berkeley, it uses a lightweight interface to enable algorithm designers to express a wide range of applications such as RL.

He proceeded to show a common use-case performed in Ray —adding arrays of data stored in two different files. Ray makes it easy to not only to parallelize these tasks but also gives algorithm designers the flexibility to configure a heterogeneous setup. By allocating different workers for applications (e.g : GPUs are well suited for those embarrassingly parallel gradient descent steps) and for specific durations. He concluded the talk by giving an overview on how Ray has been thoughtfully designed for high performance and scalability.

Matroid — Reza Zadeh [Slides]

Reza Zadeh gave several impressive demos of the video and imaging detection efforts being pursued at Matroid. Matroid is a product for creating, using and combining state-of-the-art detectors. Besides adding support for new state-of-the-art detectors, Matroid has developed a Studio for customers to use these detectors and has now enabled support for live-stream monitoring.

In addition, there were two new product announcements, Camera Partner and Matroid on-prem. While the former compresses a customized detector to fit on a camera (jointly developed with Intel), the on-premise solution is for cost-sensitive and privacy conscious customers.

A Pod, the basic unit of K8s, is a collection of one or more containers [Image Source : Twitter]

While giving us a teaser of Matroid’s training, inference and ingestion infrastructure, Reza spoke about how his team leveraged Kubernetes pods to achieve GPU and CPU autoscaling. K8s is a container orchestration framework which manages scalable and fault tolerant pods (collection of containers). By treating each video as a Kafka topic (a stream of messages), the team has found creative ways to harvest cheaper AWS spot instances.

Cat out of the bag [Image Source : Twitter]

A lot of information for me to process in 40 minutes, but I learnt that it is just pure co-incidence that Matroid is a generalization of Tensor and that both Matroid and Tensorflow debuted in late Fall 2015). Reza also announced that his book, Tensorflow for Deep Learning, is out earlier this March.

Programming the 2.0 stack — Andrej Karpathy

The Director of AI, Tesla is not unfamiliar with giving talks to an auditorium with standing room only at Stanford. His idea of Software 2.0 published in Fall ’17 formed the backdrop of this talk, as he was vocal about the challenges encountered by AI agents deployed in Tesla’s flagship cars. It should be noted that Tesla has deployed the highest number of autonomous robots (0.25 million) in the world as part of their Autopilot project.

He first began by illustrating the fundamental differences between Software 1.0 and Software 2.0. While the former demanded domain expertise in languages such as C++ and has enabled engineers to design algorithms and build systems, Software 2.0 is all about data massaging and setting up optimization infrastructure. It goes without saying that Software 2.0 is not just computationally homogeneous (think repetitive Conv/ReLU blocks in a CNN) but they also have constant running time and memory usage during runtime.

Datasets vs Algorithms (Image Source : Twitter)

Karpathy explaining some speed-limit corner samples encountered by Tesla autopilots on a regular basis (Image source : Twitter)

Karpathy went to explain how the control and vision stacks of an autopilot need to co-exist for a successful autonomous vehicle. He noted the growing emphasis at Tesla AI on labeled data inventory as opposed to focus on model training. While presenting some hilarious edge-cases of real data encountered by AI agents, he stressed that these rare examples are precisely the subset which the agents should always get right!

Software 2.0 is real. And it is interesting to note that unlike its predecessor, a toolchain for the 2.0 stack is non-existent with the code (model weights and biases) being implicitly defined by the dataset. Also, while we are at it — any IDE ideas for 2.0?

ML Problems in Biomedicine— Jennifer Chayes [Slides]

The Research Director of Microsoft Research, New England presented some fascinating findings for diagnosing and preventing cancer.

Machine Learning Problems in Biomedicine (Source : Twitter)

In a partnership with Adaptive Technologies, researchers have developed a platform for Cloud Scale Immunomics. Further, an ingestion pipeline moderates how the immunosequenced blood samples serve as inputs to the ML algorithm. The algorithm can generate a map of the immune system by matching millions of T-cells to diseases they recognize.

Systems and Machine Learning — Jeff Dean [Slides]

Jeff Dean highlighted efforts undertaken by Google Brain towards democratizing Machine Learning. He shared some key Post-NIPS ’17 results while recapping some of the top announcements from Google earlier this year including Cloud AutoML and beta release of Cloud TPU Machine Learning accelerators.

TPUs available on beta [Image Source : NIPS’17]

Jeff mentioned how several human-designed systems and tools built with simple heuristics can be replaced by ones learnt by machines. Case in point is one of Google Brain’s recent papers, The Case for Learned Index Structures. The idea is to replace conventional database index structures (such as B-Trees and Bloom Filters), with adaptive structures that learn. The authors have observed an order of magnitude improvement in search speed and memory, using real-world data-sets.

ML Arxiv papers are outpacing Moore’s Law [Image source : NIPS’17]

Jeff’s suggestion to develop faster/more data efficient algorithms was Meta Learning — let the algorithm teach itself. He concluded that ML hardware is at its infancy and low precision Linear Algebra (a.k.a faster training and wider deployment) is a safe bet for Machine Learning accelerators.

Anima Anandkumar — Role of Tensors in Large-Scale Training [Slides]

Anima explaining multi-dimensionsional Tensors [Image source : Twitter]

Anima Anandkumar, Principal Scientist at AWS and Professor at Caltech gave an enlightening overview of Tensor Algebra. She argued how tensors as learning representations can encode data dimensions, modalities and higher-order relationships. Embarrassingly parallel Tensor operations such as contractions have inspired several compact and more accurate neural network architectures.

She introduced Tensorly, a framework built to support Tensor Algebra, which not only has user-friendly APIs but also offers flexibility in terms of support for several backends such as Numpy, Tensorflow and MXNet. It is worth noting that the Tensor Regression Networks with Tensorly and MXNet won the Best Poster Award at NIPS ‘17.

Meta Learning and Self Play — Ilya Sutskever [Slides]

As a beginner in Reinforcement Learning, I was super-impressed by how Ilya, Co-founder of OpenAI helped me understand the underlying ideas of several RL papers in layman terms and powerful demos.

Touted as the missing piece that completes the learning puzzle, Reinforcement Learning is certainly at its infancy. A truly good RL algorithm combines elements from supervised and unsupervised learning. So, here you go:

Try something random, if it works, use it again

Recently, Yann Lecun had stated that RL is the cherry which completes the cake and is the missing piece to solve Machine Learning.

RL as cherry on top of the learning cake (Image Source : A New Path to AI)

Ilya concurred that the exploration problem can be hard, often ridden with failures and the best way to learn is to introduce virtual goals where agents can learn from these failures.

The idea of evolved policy gradients was introduced at ScaledML ’17 — this strategy tries several policies to see which runs best. Ilya went on to conceptualize learning as hierarchical actions for meta learning, before discussing Self Play. The idea behind Self Play is to convert compute to data — this helps us leverage “pre-trained dexterity” by competing against an opponent. Take the case of DoTA 1v1 — it beat humans by discovering unconventional strategies which would have not been realized by traditional supervised learning techniques.

Large Scale Deep Learning with Keras — Francois Chollet [Slides]

Francois Chollet, the author of Keras, one of the most widely adopted Deep Learning frameworks gave a holistic overview of multi-GPU and multi-TPU distributed training. He attributed factors such as developer experience, support for multiple backends (such as Tensorflow/Theano) and ease of model productization for Keras success. Not surprisingly, Keras is the numero uno front-end API of choice for several Machine Learning practitioners.

Keras usage is most dominant in the industry (both large companies and startups) and in the overall data science community. [Image Source : Twitter]

Francois demonstrated how to build a video question answering service in minutes. He chose to leverage transfer learned models which map statistical characteristics of the video dataset to those of the question-answer pairs — using CNN and LSTM networks respectively. Once the packaged binaries are uploaded to the Google Cloud ML Engine, he illustrated how the Data Parallelism Architecture leveraged by Jeff Dean and his team at Google could be realized by distributing the training workload among an arbitrary number of GPUs while hosting data in Google Cloud.

Hardware for Deep Learning — Bill Dally [Slides]

Bill Dally, Chief Scientist at Nvidia gave an overview of the hardware advances for ML and pitched on how decision makers should consider cost-effective GPU accelerators as they continue to leverage AI for critical business decisions.

Rise of GPU Computing [Image Source : Nvidia Developer]

Just like in real estate, in (computer) architecture, location is everything

The entire session was laced with witty remarks. There was the customary slide to eulogize Moore Law. He also mentioned that GPUs are better TPUs than TPUs (thankfully Jeff wasn’t around). He retorted that GPU-powered autonomous vehicles developed at Nvidia can actually detect pedestrians!

Bill went on to argue that the hardware requirements vary for training and inference — the latter has much less memory requirements due to different data representations. He remarked that fast memory is expensive for the same reason that Palo Alto real estate is expensive―there isn’t much space close to where the compute happens. Memory footprint requirements (and in turn FLOPS (Floating Point operations per second)) are dependent on model precision and accuracy. An average speedup of 2x is observed when models are moved from FP32 to FP16.

Currently Nvidia has mixed precision support with Volta TensorOps which includes some aggressive training methods. It is noteworthy that during GTC 2017, Nvidia announced their Programmable Inference Accelerator, TensorRT as well as Nvidia GPU Cloud containers optimized for specific operations by having exclusive instructions.

Graphcore — Simon Knowles

I had a conflicting meeting so I had to miss the session.

ML at Facebook : An Infrastructure View — Yangqing Jia [Slides]

The last talk of the day was about the the hardware and software infrastructure that empowers global scale Machine Learning at FB. Yangqing noted that ML workloads are extremely diverse: services require several different types of models in practice.

Infrastructure for the an army of models [Source : Twitter]

Naturally, model diversity demands regulations at all layers in the system stack. Further, a non-trivial portion of data stored at Facebook flows through these pipelines, presenting significant challenges in delivering data to high-performance distributed training flows. A couple of key takeaways were the emphasis on of co-locating data with compute and the opportunities to leverage a significant number of CPUs available for distributed training algorithms during off-peak periods.

Thanks for the read — if you found this article interesting and would like to stay in touch, you can find me on Twitter here