🚀 Watch Me Scale: A Story of Productivity in Deep Learning with Metaflow and Kubernetes

💡 Introduction

Exponential Growth of Compute for Training Smarter AI System Every year
Exponential Growth of Compute Needed in AI

❓What is Metaflow?

The Hierarchy of Needs in Data Science/Machine Learning/Artificial Intelligence.
The Hierarchy of Needs in Data Science/Machine Learning/Artificial Intelligence.
An example of how Metaflow represents a given data flow graph as code (Metaflow Website)

Containerization for Reproducibility

The Ultimate Developer Excuse

Horrors of Vendor Lock-In

One major drawback of Metaflow was its AWS coupling. Due to this, anyone using the library’s capabilities is locked into AWS’s solution. For a business, this is termed as Vendor lock-in. Vendor lock-in is dangerous for businesses. It is a scenario where a business is restricted in making changes to their system due to a tight coupling with certain vendors (e.g. Cloud services like AWS, Digital Ocean, etc.). 4 years working for a growing startup in India teaches you how nasty the costs can get when you are vendor locked in. (THANKS AWS!). With open-source, moving systems around becomes much simpler and, in turn, affects developer productivity less. This plugin will enable Metaflow to operate Cloud platform independently because of integration into an open-source container orchestration framework.

🌟 The How: Workings of Metaflow

The architecture of Metaflow
  • Runtime i.e. when the code gets run.
  • Result-Time i.e. when the results of the run get used.

Development-Time Abstractions

Metaflow works on the Dataflow Paradigm.

  • Decorators can be at a Flow level or at a Step Level.
  • Every Step can access properties set by previous Steps when executed in a distributed fashion.
  • Flow can contain parameters. Parameters can be files loaded at the start or constants set for the flow. Parameters can’t change over the course of the flow.
  • Steps decorated with special decorators like @kube or @batch would be executed on the specified distributed platform like Kubernetes or AWS Batch, respectively.
  • Environments for Steps is managed by decorators such as @kube, @batch, or @conda.@conda can help manage Python environments while @kube, @batch can help manage the environment through docker images.
  • Is it repeatable? Under certain circumstances, a task might be retried, rerun… etc. with the same inputs. It’s expected to produce the same outputs every single time.
  • Is it a pure function — i.e. does it have side-effects that are not known to the system (e.g. calls a web-service)? It’s strongly advisable to avoid side-effects in tasks. When side-effects are required, ensure that those operations are idempotent.
  • Is the time/energy base cost or re-computation high enough to attribute preprocessing?

Runtime Abstractions and Kubernetes Integration

When a flow is executed, A runtime (Run) is created which manages the scheduling of Tasks(Steps) during the flow. Tasks are executed on the local environment as a parallel Python process or distributed computation platform such as a Docker container. This behavior is influenced by the decorators set during development time. The Runtime manages the scheduling and monitoring of tasks and the entire flow.

Running Metaflow on the local machine with File system based Metadata and Datastore
Metaflow Kubernetes base job execution.

Result-Time Abstractions

Each executed/executing Run can be accessed using the Metaflow Client. Every Run object will contain the properties set by the steps. These can be used for analysis in Notebooks. For examples, check out code examples in end. For more information regarding Run in Metaflow check here.

📈 Metaflow-Powered-Project: Deep Learning In Robotics

My robotics project has been an excellent contributor to my knowledge/understanding of deep learning. The reason I wanted to mention this project was to showcase how versioned machine learning experimentation can influence how quickly one finds the most optimal model. This project is also meant to showcase quick scale-out after local prototyping.

Project Specification

This project was a means through which I learned deep learning in Robotics. In this project, I worked with 2 of my colleagues to train a large number of robot agents and compare the performance of those agents with all previous agents we trained.

Problem And Solution

The problem we were trying to solve was training a robot to reach a point in space.

  • One Flow leveraged Imitation Learning(Learning from expert demonstrations). In this flow supervised expert data was used for supervised training of the robot. Then the robot was evaluated in different environments.

Interesting Observations Around Training

Low Setup Overhead

Gradients collected from a deep learning experiment showing average gradient flowing across different layers at different epochs. Each plot represents a different network that was trained.
Best running agent which could reach the goal 75% of the time.

💡 Vision For the Future

Metaflow has leveraged a neat design paradigm through which it can have a large productivity impact. But there is still a lot of room for improvement. This section highlights some aspects where Metaflow can be enriched by the opensource community to support richer features catering to a larger audience.

Distributed Deep Learning Support

SOTA in Deep Learning is Evolving Exponentially and So are the number of parameters of the model.

In Built-Metrics Collection For Flows

Metrics collection for generalized deep learning experiments can have direct or plugin-able integration into a Flow. Metrics can include loss of models, gradients, accuracies, analytics, etc. The means to do this is open for discussion and contribution in opensource. Utility tooling like this makes the process of testing and prototyping faster.

New Schedulers for more Fault-Tolerant Scaling

I used Metaflow native scheduler to schedule the different steps in the flow. This scheduler can be switched with other production-grade DAG schedulers like Argo and Airflow.

GCP Integrations

Kubernetes Setup and automation for Metaflow ready GPU clusters are tested on AWS. A GCP based cluster support for the future can make the library very powerful. An opensource Integration into Kubernetes with GCP is also underway at Freenome.

GUI For Metaflow

GUI will always be a value add.


Fast Scalable Compute Comes Cheap If Used Mindfully

Deep Learning is very tightly coupled with experimentation. Good solutions to problems require more than one test/iteration in modeling/data-transformation.

There can be no objectivity without enough data/experimentation

Role of Design in Scalability

⚖️ Trade-offs and Related Works

References and Special Credits

Talk is cheap. Show me the Code.

Special Credits

A special thanks to Kamalesh Kalirathinam (kamalesh1@asu.edu), Shravan S Rai (srai25@asu.edu), and Jacob Fang (zfang29@asu.edu) who have been a great support while building both these projects.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store