A Review of Netflix’s Metaflow

Axel Goblet
Dec 20, 2019 · 9 min read
Photo by tian kuan on Unsplash

tl;dr Metaflow is a framework that alleviates several infrastructure-related pains data scientists experience in their projects. It takes care of versioning, dependency management, and management of compute resources. If you use AWS, Metaflow can help you structure your data science projects from early development to production. For problems that it does not solve, other tools exist that do the job. If you are not using AWS, Metaflow is not that useful.


Recently, Netflix open-sourced Metaflow, their internal data science platform. The platform was built to reduce the time-to-market of data science initiatives at Netflix by tackling common pain points related to infrastructure and software engineering. Building a data product requires a broad set of skills. Data scientists generally do not possess all of these skills. The figure below visualizes this well. Data scientists want to focus more on solving business problems by building machine learning models, rather than setting up an excellent infrastructure to facilitate this process.

Figure 1: aspects of a data product

Metaflow takes away some of the burden of the infrastructure-heavy work from data scientists. As it is a platform, it does not solve a single problem. It serves as a generic framework to increase data scientists’ productivity during both experimentation and productionization.

In this article, I will briefly highlight the features of Metaflow. After, I will tell you my opinion on it, and when it is useful for you. At first sight, it reminded me of Apache Airflow. Therefore, I will compare the tools on some facets.

Note: I have not worked with Metaflow in a real-world scenario. The review is based solely on my first impressions of reading the Metaflow documentation and code.


Metaflow Features

Structure

Figure 2: an example Metaflow DAG

Compute Resources

Metaflow takes care of serializing, persisting, and de-serializing objects between steps. Take the above graph as example. In the start step, self.data gets filled with the loaded data. in the fitA step, self.data is then used to fit a model. When running this workload locally using a single Python interpreter, self.data is obviously available in both steps. When running on AWS Batch, multiple Python interpreters will be used on potentially multiple compute nodes. Metaflow handles the communication between steps for you by pickling objects that are passed to next steps, and storing them in S3. Metaflow also ships with an high-throughput S3 client, which is integrated with Metaflow’s versioning feature.

The serialized intermediate results of your DAG are part of the versioning functionality of Metaflow. Whenever the DAG is ran, the serialized objects are stored in a new, unique location. This is handled by the Metaflow library. Along with that, metadata is stored, providing information about a run. This metadata can be queried via the Metaflow Python client library.

In Metaflow you can use the @resources decorator to define the required resources of a step. These resources will then be provided by AWS Batch if they are available in the cluster. When running in local mode, the decorator will be ignored. This decorator provides a way of tweaking the resources per step. The decorator also allows you to request GPUs. In addition to the @resources decorator, Metaflow provides the parallel_mapfunction to do multi-core processing, similar to Python’s multiprocessing library. Splitting a step over multiple compute nodes is also possible, using Metaflow’s Foreach feature.

Error Handling

Dependency Management

Many Python packages require non-Python dependencies. Think of compilers and database drivers. These dependencies are usually not available to Metaflow steps, as steps are executed in a vanilla Python docker image by default. When requiring non-Python dependencies, you can run your steps in a custom docker image.


The Review

Interface

State

The state sharing is different from e.g. Airflow, where every step generally starts and ends with data at rest in a data store. Airflow does offer state sharing through XComs, but requires you to explicitly send and retrieve the state in your steps. An alternative to XCom in Airflow is to use the output of the previous step as input of your next step. If you wish to do pass state between steps in Airflow, you have to think about (de-)serialization yourself. Because of this, Airflow comes with various components that handle interaction with many popular external systems for you. This provides quicker integration with data stores other than S3, but adds more complexity to your DAG as well. This difference between Airflow and Metaflow stems from their design goals. Airflow has a strong focus on ETL pipelines, whereas Metaflow is built for Python-based machine learning workflows.

Cloud Environment

Missing Features

Metaflow does not help you serve your models in production, using a REST API or Kafka consumer for example. Again, this problem is tackled by open-source (Seldon) and AWS based (Sagemaker) tools.

Scalability

For the cases where horizontal scaling is useful, Metaflow provides you with a straight-forward way of achieving this in your steps. Under the hood, AWS Batch spawns a container on ECS for every Metaflow step ran. If individual steps take long enough, the scaling should outweigh the overhead of spawning these containers.

You cannot specify the type of GPU requested with the @resources context manager. AWS offers various types of GPUs, and they can be added to your AWS Batch cluster, but selecting the GPU you would like to use for your machine learning model is not possible in Metaflow. As the choice depends on the model architecture being trained, it would be a good feature to be able to specify this.

Dependency Management

Local packages get packaged with your DAG code for remote execution, and get added to the path. This makes it easy for your DAG to reference the code executed by the various steps. This is nice, as it allows you to develop your code as a Python package, which you do not need to somehow install on your remote hosts or in a Docker file. Developing it as a Python package makes imports and testing very easy.

The option to provide your own Docker container to run your steps in greatly increases the flexibility of the product. By using this, you should be able to handle most data science project dependencies.

Lock-in

Most Metaflow features can be added to a step using decorators. Metaflow provides base classes for step- and DAG-level decorators. This allows you to build your own decorators that add custom logic. This extensibility greatly reduces the lock-in of the product. Currently, no community-maintained plugins exist yet.


The Verdict


About the author

bigdatarepublic

DATA SCIENCE | BIG DATA ENGINEERING | BIG DATA ARCHITECTURES

Thanks to Dick Abma, Steven Reitsma, and Ruurtjan Pul

Axel Goblet

Written by

bigdatarepublic

DATA SCIENCE | BIG DATA ENGINEERING | BIG DATA ARCHITECTURES

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade