Scaled ML 2017, Thoughts & Takeaways

I recently had the chance to attend Scaled ML 2017, a machine learning conference at Stanford University held on March 25th. The conference was mainly focused on technologies such as Tensorflow and Apache Spark, with prominent industry and academia figures giving talks on their work in scaling machine learning, deep learning, and reinforcement learning. Below are a few of my takeaways from the event.

Talk by Ion Stoica:

Ion Stoica is a Berkeley professor and chairman at Databricks, who’s main focus is currently the RISElab, that is dedicated to research in providing real-time, intelligent decisions from live data. This comes with a few challenges, such as clarifying what makes a good decision, dealing with latency issues, ensuring fault-tolerance, and keeping security of data as a primary concern. In his talk, he discussed these challenges with real-time decision making.

First, Stoica talked about the idea that data is only as good as the decisions that it enabled. This was a reminder that having a massive amount of data is only a prerequisite, not the end-all: the data would not be useful if it did not provide any actionable insight. This made me realize that it is still important to use traditional data analysis and visualization techniques on a dataset before applying machine learning algorithms, just to figure out whether any interesting insight can even be drawn for it, or if another dataset would better suit the task at hand.

Next, Stoica talked about the idea that decision systems must be both intelligent and robust. Intelligence mainly refers to the ability for the learning algorithm to generalize well, and perform well in complex and uncertain environments, including those to which it had not been exposed to previously. For example, if you’re training a decision system to automatically detect fraudulent bank transactions in your system, your training dataset definitely will not contain all possible instances of fraud, or really, anything close to it. This relates back to the curse of dimensionality, if your training data contains small feature vectors of only 10 features that each can take on 10 different possible values, then you’ll need a training dataset of 100 million examples to even cover 1% of the possible feature space. This is one of the reasons applying machine learning in practice is quite difficult — the challenge of generalization from training data. Similar to being intelligent, decision systems must also be robust, meaning that they should be able to handle complex noise in the inputs, be resilient to inputs that could cause it to fail, and when it does fail, fail in a safe manner. This also covers the ability to say “I don’t know” when the system is highly unsure of its decision, so that control could be turned over to a human who can assess the situation. Robustness in this context is especially essential in machine-learning powered driverless cars, where failing to fail safely can be potentially lethal to the car’s occupants, and the car should come to a stop if it is unsure of what to do, possibly when confronted with an object it hasn’t seen before in its path.

Stoica also talked about security in the context of “shared learning”, or learning from confidential data. He discussed the common scenario where an enterprise company wants to gain insight from the (confidential) data that they have access to, and companies which provide ML as a service (such as Amazon AWS ML and Google Cloud ML) want their MLaaS product to be used and to learn from as much data as possible. However, the enterprise knows that its data is confidential so it can’t use the services offered. A goal of these MLaaS companies is to incentivize these enterprises to use their services, as part of their quest to become “AI-first” companies. If the data is anonymized so that it can’t be linked back to an individual, then the data may be able to be shared with MLaaS providers, but learning on this kind of shared data is difficult, and is an active topic of current research.

Talk by Jeff Dean on Google Brain and Tensorflow

This was probably one of the most interesting talks at the conference, as it offered an insider’s perspective on the active research going on at Google Brain, and a discussion about how tools such as Tensorflow and the latest deep learning technologies can greatly contribute to building artificial general intelligence in the future.

Google Research is concerned with both machine learning from an academic perspective — doing active research in creating new machine learning algorithms and introducing novel concepts such as word2vec, and also applying machine learning to a bunch of different domains, from language translation, to robotics, to (more recently) healthcare. Today, Google uses some form of machine learning in almost every product, and Jeff noted that many of the current engineers at Google are taking internal classes to get familiar with machine learning and deep learning, as well as the tools and infrastructure needed to deploy it at scale. Jeff Dean also mentioned that machine learning and deep learning should be an essential tool that is in every software engineer’s arsenal.

This is where Tensorflow comes in. Tensorflow is a graph-based numerical computation library that is great for deep learning. It enables researchers and engineers alike to quickly run deep learning experiments and tinker around with all kinds of different models, while providing a robust backend that can run across several different GPUs. This allows the low-level programming and hardware challenges needed to do deep learning be abstracted away from the engineer, so that they can focus on rapidly iterating on their prototypes.

The need for Tensorflow, according to Jeff, came from issues with other open-source deep learning libraries that already existed. Jeff mentioned that flexibility, scalability, and production-readiness were three essential qualities that were needed in a deep learning libraries; however, before Tensorflow, most libraries lacked the flexibility aspect.

Next, Jeff talked about some of the recent deep learning successes that Google has had, and what they’re currently working towards.

Doing machine translation has really well has always been a big goal for Google. Before they applied machine learning to the problem, they had done pretty well, but their translation scored below bilingual humans (who in theory, can perfectly translate). Using recurrent neural networks with LSTM cells, and a large corpus of training data (say, English to French pairs), Google was able to make a sequence to sequence model (a model where your inputs and outputs aren’t fixed; they’re arbitrary length sequences) that scores pretty close to bilingual humans. What’s more interesting is that this model is even capable of translating between languages that it hasn’t been explicitly trained on. This shows some of the progress towards the goal of general intelligence — make machines do well on tasks that they haven’t explicitly been trained to do.

Another interesting application of machine learning was to the field of medical imaging. Jeff talked about how a deep learning algorithm has been developed and validated to detect cases of diabetic retinopathy, and a separate deep network for tumor detection. In some cases, the tumor detection algorithm even beat out medical pathologists; sometimes the pathologist would notice a tumorous region that they had previously missed after the deep network detected it.

Ever since reading a little of Vinod Khosla’s 20% Doctor Included write-up and listening to his Stanford Medicine X talk, I’ve been intrigued by the possibilities that deep learning has in the medical field. On the other hand, I’ve been really concerned about what kind of ethical protocols need to be instituted for this to be done safely — doctors have been following strict ethical guidelines on how to practice medicine for more than centuries, while we don’t even know what a deep neural network really learns. Software engineers don’t really have a strict ethical guideline that guides the code they write (The Code I’m Still Ashamed Of), and there already have been a lot of issues brought up from machine learning and artificial intelligence — from the Tay chatbot incident to Facebook’s censoring of a symbolic, Pulitzer prize winning image. If we’re going to be implementing deep learning technologies in something as critical as saving human lives, it would be intelligent to do a significant amount of research into the ethics of deploying such as system.

Finally, Jeff talked a little bit about the future of machine learning and work towards artificial general intelligence. First, he said that today, success in machine learning generally follows the pattern of ML expertise + a lot of data + a lot of computation. In the future, machine learning could become so advanced that the ML expertise part of the equation can be taken out, and we’d get a pattern of success = a lot of data + 100x computation. Some of the previous speakers touched on this idea as well — currently, doing machine learning successfully and at scale requires a team of researchers proficient in the most cutting-edge deep learning algorithms, but the eventual goal is to place the powers of machine learning into the hands of the domain expert, such as the common biologist or real estate agent. A possible way of doing this that is currently being studied is the idea of using reinforcement learning that trains a model that generates other models from a probability distribution. For example, a “king” model could generate a bunch of (initially random) models to solve a particular problem, and the gradients from the losses of these models can be backpropagated through a reinforcement learning policy network, in order for the “king” model’s policy of generating models to be updated to favor the generation of better models.

Doing this kind of training would take a lot of compute — research has shown that large-scale reinforcement learning with policy gradients is quite expensive in terms of resources required. However, the core idea behind all this is that if advances in compute infrastructure continue, then this idea will become feasible. One can just train all of the models and then pick the best one. This idea seems kind of strange to me though, since I’ve been taught repeatedly in my computer science classes that efficiency — even in the presence of high-powered machines — is essential, and brute force algorithms are not the way to go.

Talk by Ilya Sutskever of OpenAI

This was another really interesting talk, about scaling reinforcement learning using evolutionary strategies. Evolutionary strategies are actually a well-studied family of algorithms (research dating back to 1989), but only have recently been applied to reinforcement learning problems.

First, Ilya talked about the potential for RL research to solve really interesting problems. While deep neural networks by themselves have driven a lot of progress in areas such as speech recognition, autonomous vehicle navigation, and image classification and captioning, reinforcement learning has the potential to eventually build artificial general intelligence. The main idea is that we currently have pretty good reinforcement learning algorithms that can be trained to do tasks (mostly play games) pretty well, but aren’t generalizable across many different tasks. The eventual goal is to train these RL agents in several different environments so that they don’t just perform one task really well, and instead generalize to other tasks. This is a big reason why OpenAI has released many open-sourced environments for reinforcement learning, such as gym and universe: people can submit their own environments (such as a tic tac toe, or even typical web browsing), and an RL agent can be trained to do it will by learning only from the pixels in the hundreds of thousands of frames it is given.

The evolutionary algorithm is a pretty simple one: the big picture is that you add random noise (sampled from a Gaussian) to your parameters, accept the noise if your objective function improves, and reject it if it does not. Surprisingly, this algorithm works really well in reinforcement learning tasks, attaining the same level of quality as traditional methods such as policy gradients and DQN (however, evolution strategies are really slow in the supervised learning context). Also, evolution strategies can be parallelized really well, since the different nodes of a distributed system only need to broadcast scalars to each other, instead of vectors of gradients that can get extremely long. Interestingly, it turns out that the evolution strategies approximate the same gradients that the policy network computes via stochastic gradient descent, so the typical methods of making deep neural networks and gradient descent work still apply (namely, the scale of your initialization, learning rate, and adjusting other hyperparameters). The main downside to this approach is that it requires 3x-10x more training data than the policy networks approach, so it may not work as well in environments where training data is limited.

Overall, the conference was really interesting and I definitely learned a lot about the current state of machine learning research in both academia and industry, and where it is heading. I’m excited to see continued ML advancements in the future.