🚀 Watch Me Scale: A Story of Productivity in Deep Learning with Metaflow and Kubernetes

16 min readMay 15, 2020

Prologue
I am software engineer who has been building software for over 6 years. I came to Arizona State University in 2019 to study Artificial Intelligence and Machine Learning. In my journey through Deep learning and AI, I encountered a life saving library which greatly improved productivity of research. This blog post entails my experience of Metaflow through the lens of some projects I did to learn/apply deep learning.
This is my first blog post and I am a huge fan of long-form content. This post is meant to be a long-form content article and there is no TLDR :) But I have tried to keep the story as simple and fun to read as possible. I have also backlinked one section because content was getting very long :) . The code to this all projects is opensource so if you need reference from my projects, go ahead!

💡 Introduction

Exponential Growth of Compute for Training Smarter AI System Every year — Exponential Growth of Compute Needed in AI

Deep learning methods have become mainstream in the past 6–8 years because they’ve shown a lot of improvement in accuracy when given lots of data. But these algorithms and methods, in turn, require lots of experimentation with that data to get solutions right. OpenAI released an analysis showing that since 2012, the amount of compute used in the largest AI training runs has been increasing exponentially with a 3.4-month doubling time (by comparison, Moore’s Law had a 2-year doubling period). Since 2012, this metric has grown by more than 300,000x (a 2-year doubling period would yield only a 7x increase). The image shows the compute for SOTA models on a log scale plot.

This growth in data and compute along with the subsequent need for experimentation becomes troublesome for Data Scientists/AI Researchers as their projects now require a broader set of skills, instead of just the capabilities to understand and infer from data. This can lead to a lot of loss in productivity because data scientists/researchers individually have to take care of versioning, scaling and data warehousing needs of a project. I first came to realize this when I created my first trashy(like all first projects) deep learning project on Fake News Detection with Hierarchical Attention Neural Networks. I built it in Keras and like any noob newly navigating through the deep learning space, I set up my own “model-checkpointing” and “hyperparameter versioning” methodologies. Although words may sound fancy, I was
“git committing” machine learning models and maintaining hyper params through a text file like a naive fool even when knowing its not a good approach. Thinking about it today makes me cringe! Trust Me, NEVER GIT COMMIT YOUR DL MODELS.

As the first project’s go, it was a great learning experience surrounding deep learning. But it also left the impression that deep learning is an awfully painful process when it comes to managing all the “Crap” that gets created with the model.

Through the course of the last 6 months, I have indulged deeper in “deep learning” to understands its capabilities in terms of real-world impact. Initially, I started playing around with different algorithms, network architectures, etc. but the same problem came about. How do I version my models/data/hyperparameters with scale so I can investigate upon reasons for different performance outcomes between models? I believe this practice of “logging/versioning everything” is a very important foundation around any type of concrete AI or ML research. Survivorship Bias and Confirmational Bias can skew a person’s perception towards a result that favors their “intuition”. Large volumes data create a means through which the “truth” of a hypothesis can be measured objectively. So in the search for a nice versioning library, I stumbled across Metaflow. I was looking for versioning but I found a wonderful package of surprises with Metaflow.

❓What is Metaflow?

The Hierarchy of Needs in Data Science/Machine Learning/Artificial Intelligence.

During the last 6 months, I investigated different available options for versioning/scaling. There was a common pattern that emerged in the development of solutions for ML productivity: all solutions created in the ML productivity space leverage abstractions around the layers present in the hierarchy of needs shown in the above figure.

One of the most common approaches to solving such ML pipelining problems is by the transformation of the workflow into a Directed Acyclic Graph (DAG) following the dataflow paradigm. Many workflow engines that are currently available, such as Airflow Flyte, etc., are solving productivity problems for a lot of businesses. One of the biggest drawbacks of most of those technologies is their tight coupling with their computational backends which make them not so useful for prototyping in research.

In December of 2019, Netflix open-sourced its internal Data Science Python library, Metaflow. The library was built to reduce the time-to-market of Data Science initiatives at Netflix by tackling common pain points related to infrastructure and data engineering.

Building a data product requires a broad set of skills and Data Scientists generally do not possess all of these skills. Metaflow provides a human-friendly way of bridging the gap between most of these skills.

An example of how Metaflow represents a given data flow graph as code (Metaflow Website)

Metaflow offers a simple Pythonic way of defining distributed steps of computation as a DAG with common data being shared across states. This enables the parallelization of experiments pertaining to the model. Distributed Parallelization is enabled by running the multiple individual steps in a Dockerized environment with an easy specification of resources. Metaflow ensures the storage of data across steps and so the re-computation cost is reduced because of this library. This also enables quick prototyping and analytics. Metaflow also provides a way to quickly access results on completion of every workflow. The most powerful feature of Metaflow is that the library is self-contained, meaning the same source code is run on a local machine or on a Dockerized distributed computation setup. The library’s design choices around packaging isolated computation make it extremely adaptable for new compute and data warehousing abstractions.

After reading the library’s core and understanding its flexibility, I WAS SOLD. Even though Metaflow was open-sourced with AWS support, I realized that the library’s design made it extremely simple for integration into other computing environments for quick scale-out. The flexible code structure of this library gave me thoughts around exploring two of my favorite aspects of computer science at the same time. Large Scale Cluster Computing and Artificial Intelligence. Through the last 4 months, I prototyped a plugin on Metaflow to support Kubernetes as a compute engine and built a robotics project using a well-known simulator with the help of this plugin. The robotics project leveraged Metaflow as means to log metrics, craft different experiments, and scale-out computation.

If you are unaware of what is Kubernetes, please don’t worry. Kubernetes, in the most simplistic sense, is a platform that helps you deploy/configure/scale different applications across a cluster of servers by leveraging Docker containers. It started as an internal project in Google(named borg) since 2003 and finally got open-sourced in 2015.

There were two major reasons which inspired me to build a Kubernetes plugin on Metaflow.

Containerization for Reproducibility

Metaflow leverages the power of Docker containers to run isolated execution steps in a distributed fashion. Before this project, the library only supported the use of Docker containers when running on AWS Batch. Docker support for local runs is really important because it avoids the problem highlighted in the figure shown to the left. The current support for Kubernetes with Metaflow enables reproducing localized tests for production workflows.

With the help of Minikube (a localized version of Kubernetes cluster), one can prototype workflows on their local machines which can mimic production architectures. Local prototyping can be done even for GPU workloads. Enabling efficient prototyping of flows through Docker containers on local machines makes a huge productivity difference while building large flows that have different dependencies.

Horrors of Vendor Lock-In

One major drawback of Metaflow was its AWS coupling. Due to this, anyone using the library’s capabilities is locked into AWS’s solution. For a business, this is termed as Vendor lock-in. Vendor lock-in is dangerous for businesses. It is a scenario where a business is restricted in making changes to their system due to a tight coupling with certain vendors (e.g. Cloud services like AWS, Digital Ocean, etc.). 4 years working for a growing startup in India teaches you how nasty the costs can get when you are vendor locked in. (THANKS AWS!). With open-source, moving systems around becomes much simpler and, in turn, affects developer productivity less. This plugin will enable Metaflow to operate Cloud platform independently because of integration into an open-source container orchestration framework.

The next section helps explain how Metaflow Works and the means through which I integrated Kubernetes.

🌟 The How: Workings of Metaflow

The Metaflow library is built to encompass three aspects of the development cycle which make use of the architecture shown above:

Development-Time i.e. when the code gets written.
Runtime i.e. when the code gets run.
Result-Time i.e. when the results of the run get used.

Development-Time Abstractions

Metaflow works on the Dataflow Paradigm.

Library leverages @step decorators to mark isolated steps of the data flow. Every Flow is a class of which inherits a FlowSpec. The methods in the class can have @step decorators to demarcate Individual steps of computation.

Hierarchy of setup is: Flow → Step → Data
Decorators can be at a Flow level or at a Step Level.
Every Step can access properties set by previous Steps when executed in a distributed fashion.
Flow can contain parameters. Parameters can be files loaded at the start or constants set for the flow. Parameters can’t change over the course of the flow.
Steps decorated with special decorators like @kube or @batch would be executed on the specified distributed platform like Kubernetes or AWS Batch, respectively.
Environments for Steps is managed by decorators such as @kube, @batch, or @conda.@conda can help manage Python environments while @kube, @batch can help manage the environment through docker images.

When should a Step be defined?

When choosing to create a Step for a unit of computation, consider the following:

Is it terminating code? A task is expected to exit after finishing processing its inputs.
Is it repeatable? Under certain circumstances, a task might be retried, rerun… etc. with the same inputs. It’s expected to produce the same outputs every single time.
Is it a pure function — i.e. does it have side-effects that are not known to the system (e.g. calls a web-service)? It’s strongly advisable to avoid side-effects in tasks. When side-effects are required, ensure that those operations are idempotent.
Is the time/energy base cost or re-computation high enough to attribute preprocessing?

Runtime Abstractions and Kubernetes Integration

When a flow is executed, A runtime (Run) is created which manages the scheduling of Tasks(Steps) during the flow. Tasks are executed on the local environment as a parallel Python process or distributed computation platform such as a Docker container. This behavior is influenced by the decorators set during development time. The Runtime manages the scheduling and monitoring of tasks and the entire flow.

Every Run is uniquely marked because of the metadata store. The metadata store can be present on your local file system or as a REST API server with a Postgres database.

Running Metaflow on the local machine with File system based Metadata and Datastore

The artifacts (such as self.data or self.history in the figure above) generated during the execution of each task are stored in the datastore (S3 object store or local file system) in a version-controlled fashion because of the metadata store. The metadata store enables the Metaflow to get the references needed to load the correct artifact during and after completion of a Run/Step. The best way to think of Metadatastore is something that holds references to the actual values. The actual values are stored in the datastore for each individual run.

Metaflow is able to achieve this with distributed computation without specialized compute integration because under the hood it invokes the same file(myflow.py in above example) with different arguments to isolate individual step execution with the correct metadata. It sets command-line arguments (using the Click Python library) while the packaging a Task for container execution. This ensures that correct metadata is set at any point during the execution of a task. The plugin I built leverages the library’s core abstractions to package “steps” into Kubernetes by just adding @kube decorators. These steps are of the packaged as a “BatchJob” on Kubernetes. The Metaflow Runtime monitors the job and accordingly schedules new steps when the job is completed. The plugin also enables users to directly deploy the Runtime process as a container job into Kubernetes. The plugin also supports GPU provisioning for individual steps. The cluster setup automation for Kubernetes is built-via Kops. More information on cluster setup can be found at the end where the source code is shared.

Result-Time Abstractions

Each executed/executing Run can be accessed using the Metaflow Client. Every Run object will contain the properties set by the steps. These can be used for analysis in Notebooks. For examples, check out code examples in end. For more information regarding Run in Metaflow check here.

The next section explains one of my project’s which leveraged Metaflow.

📈 Metaflow-Powered-Project: Deep Learning In Robotics

My robotics project has been an excellent contributor to my knowledge/understanding of deep learning. The reason I wanted to mention this project was to showcase how versioned machine learning experimentation can influence how quickly one finds the most optimal model. This project is also meant to showcase quick scale-out after local prototyping.

Project Specification

This project was a means through which I learned deep learning in Robotics. In this project, I worked with 2 of my colleagues to train a large number of robot agents and compare the performance of those agents with all previous agents we trained.

We leveraged the plugin on Metaflow to parallelize training and simulation of multiple robot agents on Kubernetes clusters. Metaflow also enabled fast versioning of all models and hence easy reproducibility of the same results.

Problem And Solution

The problem we were trying to solve was training a robot to reach a point in space.

We leveraged RLBench as Robotics RL simulation gym similar to Open AI Gym. We built a stable abstraction over RLBench to quickly plug and play new agents, environments, reward functions, etc. This abstraction was very essential because it helped provide experimentation flexibility when used with Metaflow.

We established Metaflow based pipelines to quickly scaleout experimentation of agents. Our goal was finding the agent which could learn the most generalized “Policy” for solving the task. Meaning the robot should solve the task for different types of environments, perturbations, etc.

We created two flows to train and evaluate the robots. These flows were built with the flexibility to plug as many agents as we could create.

One flow leveraged Reinforcement Learning. In this flow, the robot was trained using an RL algorithm called DDPG and then simulated in different environments to evaluate performance.
One Flow leveraged Imitation Learning(Learning from expert demonstrations). In this flow supervised expert data was used for supervised training of the robot. Then the robot was evaluated in different environments.

This entire flexible training setup enabled me and my colleagues to create and experiment with various neural network architectures and features.

Interesting Observations Around Training

Low Setup Overhead

We could successfully train more than 50 model variations over a course of a few weeks with different training sessions whilst comparing performance evaluations from past data. We were able to achieve this via fast cluster setup from the automation I built for Kubernetes and a metrics logging strategy for quick analysis post-completion of flow.

Focus On Analytics Instead of Scale

Gradients collected from a deep learning experiment showing average gradient flowing across different layers at different epochs. Each plot represents a different network that was trained.

Metaflow enabled the focus on the analytics of the models instead of worrying about the overhead of versioning and scale-out setup. And so, I built lots of visualizations as a part of the project. These visualizations helped me understand how to tune and tweak the networks.

Gradient collection and visualization proved to be very helpful in tuning our neural networks. Because of flexible code design we were able to collect gradients for all experiments to analyze how well the networks were learning with each epoch.

This greatly helped in tuning the most optimal models from all the models. The figure on the left shows those gradient plots which helped finally tweak the best model. My analysis became a lot clearer because I was logging all data and building reusable code which could quickly plot results to help understand different evaluation metrics for a model.

Best running agent which could reach the goal 75% of the time.

💡 Vision For the Future

Metaflow has leveraged a neat design paradigm through which it can have a large productivity impact. But there is still a lot of room for improvement. This section highlights some aspects where Metaflow can be enriched by the opensource community to support richer features catering to a larger audience.

Distributed Deep Learning Support

SOTA in Deep Learning is Evolving Exponentially and So are the number of parameters of the model.

With the increasing size of DL models, speed in training is a productivity bottleneck. The experiments I conducted leveraged Data Parallelism for model training. Data parallelism can work well for a monolith server. But it can be made even faster if the training is distributed across multiple machines or separate processes that work over individual GPUs.

In February 2020 Microsoft announced DeepSpeed, a deep learning optimization library that makes distributed training easy, efficient, and effective. Microsoft packages the concepts of distributed data parallelism, pipeline parallelism, etc. in a very simple manner to build really large neural networks. DeepSpeed was used to build a language model(Turing NLG) with over 17 Billion parameters as a part of the model. This is the largest language model until May 2020 (SOTA changes so fast!)

Metaflow doesn’t have inbuilt support for distributed training. But the library’s design is such that it can have a special fork that caters only to deep learning and integrates a wrapper around currently available platforms like DeepSpeed. Integration into a DeepSpeed can have a huge value-added benefit.

In Built-Metrics Collection For Flows

Metrics collection for generalized deep learning experiments can have direct or plugin-able integration into a Flow. Metrics can include loss of models, gradients, accuracies, analytics, etc. The means to do this is open for discussion and contribution in opensource. Utility tooling like this makes the process of testing and prototyping faster.

New Schedulers for more Fault-Tolerant Scaling

I used Metaflow native scheduler to schedule the different steps in the flow. This scheduler can be switched with other production-grade DAG schedulers like Argo and Airflow.

GCP Integrations

Kubernetes Setup and automation for Metaflow ready GPU clusters are tested on AWS. A GCP based cluster support for the future can make the library very powerful. An opensource Integration into Kubernetes with GCP is also underway at Freenome.

GUI For Metaflow

GUI will always be a value add.

Conclusion

Fast Scalable Compute Comes Cheap If Used Mindfully

Deep Learning is very tightly coupled with experimentation. Good solutions to problems require more than one test/iteration in modeling/data-transformation.

There can be no objectivity without enough data/experimentation

Leveraging technologies like Metaflow and Docker images, research ideas for models be can quickly be evaluated and iterated via small datasets which are samples from really large datasets. The computational requirement for such types of training can be satisfied with cheaper computational options.

Actual computation investment only happens for models with good promising results, even on small datasets. As large object stores like S3 are cheap for storage, if training on really large datasets is reserved for models which tested best with sparse data, then storage/compute costs remain nominal. Minio is another opensource datastore alternative that can shave S3 costs.

The cherry on top: it takes 5 Lines of code to spin up and tear down a production-grade n-server Kubernetes cluster with AWS/GCP/Openstack and otherwise AWS Batch is always there to ensure HA job deployments on AWS with Metaflow.

Role of Design in Scalability

Good design leads to quick scalability. Libraries like Metaflow, Airflow, etc., can help yield greater productivity by providing tools to leverage good design. But, ultimately, it’s in the hands of the developer to ensure efficient design for scalable ML workflows.

Bad Design Choices don’t get fixed by Metaflow.

I came to see such a bottleneck when training large models. Metaflow operates with the assumption of unfettered data access. Data growth can be very quick if models are large. Metaflow store’s data based on the properties set during the flow. I was storing 8.5GB of data for a Flow that contained over 15 very large neural networks. In such situations, design choices can make a huge difference here in terms of storage and access. Example:

With AI and Deep learning on the rise, the need for quick scalable compute is becoming more and more urgent. As the complexity and sizes of Deep learning models grow, a need for thorough empirical research is required to ensure safe AI systems. Libraries like Metaflow help ease a lot of pain in that process and in turn ensure that scientific rigor is not lost because of incapabilities of scaling/versioning.

⚖️ Trade-offs and Related Works

This section describes the various trade-offs and limitations of Metaflow and some related widely used opensource technologies. This section also covers a design comparison between various other available technologies. This section would have made the blog post really long so its kept seperately from the blog post.

References and Special Credits

Talk is cheap. Show me the Code.

Core Library integration Fork
Kubernetes Setup Documentation and Scripts
Tensorflow Training Demo With Metaflow and Kubernetes: MNIST Dataset Classification
Deep Learning In Robotics with RLBench
Mini-Image Training on Kubernetes Cluster for Large Classification Models Like VGG and ResNet.

Special Credits

A special thanks to Kamalesh Kalirathinam (kamalesh1@asu.edu), Shravan S Rai (srai25@asu.edu), and Jacob Fang (zfang29@asu.edu) who have been a great support while building both these projects.