Easy life with Metaflow for data scientists

Dmytro Vedetskyi
DevOops World … and the Universe
6 min readApr 5, 2021

Overview

We’re going to review one of the modern and simple solutions for scientists that makes life easier for developers as well as for DevOps engineers.

Let’s start the discussion about Metaflow, so Metaflow is…

Metaflow is a human-friendly Python library that helps scientists and engineers to build and manage real-life data science projects. Metaflow was originally developed at Netflix to boost productivity of data scientists who work on a wide variety of projects from classical statistics to state-of-the-art deep learning.

Metaflow provides a unified API to the infrastructure stack that is required to execute data science projects, from prototype to production.

Architecture Diagram

Library uses some of the components that help to execute Metaflow jobs easier on the Local machine and AWS.

Here is the list of the components used by Metaflow:

+============+==================+==============================+
| Service | Local | AWS |
+============+==================+==============================+
| Datastore | Local Directory | S3 |
+------------+------------------+------------------------------+
| Compute | Local Process | Batch |
+------------+------------------+------------------------------+
| Metadata | Local Directory | Fargate + RDS |
+------------+------------------+------------------------------+
| Notebooks | Local Notebook | Magemaker Notebooks |
+------------+------------------+------------------------------+
| Scheduling | - | Step Functions + EventBridge |
+------------+------------------+------------------------------+

Datastore

Datastore is a repository with centralized data for entire data that’s leveraged by and Metaflow generated flows. Data artifacts are stored in a local directory when local mode is used . Metaflow can integrate with Amazon S3 for cloud-scale storage so that you can process and persist bigger amounts of data easier.

Compute

Steps in the flow Metaflow framework executes in local mode as a separate local process. Bigger workloads which need resources that might be unavailable on a laptop (think GPUs or 128s of GBs of RAM), AWS Batch integrates with Metaflow to seamlessly run each step of the flow as a (or many) separate AWS Batch job(s) independently.

Metadata

Entire flow executions keep track of centralized places by Metaflow ships with a tiny Metaflow service.This metadata service is not necessarily needed. Metaflow has the ability to use a local directory to check all executions from your local laptop, even if you are using Amazon S3 as datastore or AWS Batch for compute. At Netflix, all executions are logged in the metaflow service and all data artifacts are stored in Amazon S3, so that any data scientist can interface with anybody’s work via the client and collaborate fruitfully. A centralized metaflow service along with a data store like Amazon S3 makes it easy for data scientists to use hosted notebooks to easily set-up dashboards to monitor their flows.

Notebooks

Netflix is a big fan of Notebooks. With the help of Metaflow, users can create custom dashboards to monitor the execution of their Metaflow flows and track how their models are behaving in a very seamless manner. It can be done on their laptops with a local notebook or in the cloud with a hosted notebook solution. One such hosted solution is Sagemaker Notebooks by AWS. For those notebooks which are hosted in the cloud, you would want to ensure that you have configured the metaflow service and are using Amazon S3 for datastore.

Scheduling

Metaflow helps users to develop, prototype and execute flows from their laptops that can scale easily by leveraging elastic storage and compute capabilities in the cloud. It happens quite often when these flows need to be run autonomously without any user participation. Metaflow makes it simpler to move the flow execution from Metaflow to AWS Step Functions to leverage all the feature sets that you get from a production grade scheduler — high availability, monitoring, reliability, etc. As a big advantage, with AWS EventBridge, users can set up triggers to execute these flows on a schedule automatically.

Deployment type

The easiest way to deploy Metaflow is to use AWS CloudFormation file that has ability to deploy Metaflow Cluster and provides API Url for executing jobs.

Executing jobs

Jobs Execution can be done from your local machine and through the CI/CD tool.

You can download an example here and test everything.

Repository has an example of a job with AWS CloudFormation file for AWS deployment.

First of all you need to install metaflow library

pip install metaflow

Example of the metaflow job

The most basic type of transition is a linear transition. It moves from one step to another one.

Here is a graph with two linear transitions:

There is actually a script:

from metaflow import FlowSpec, Parameter, step, schedule

class ParameterFlow(FlowSpec):
alpha = Parameter('alpha',
help='Learning rate',
default=0.01)

@step
def start(self):
print('alpha is %f' % self.alpha)
self.next(self.end)

@step
def end(self):
print('alpha is still %f' % self.alpha)

if __name__ == '__main__':
ParameterFlow()

Generate metaflow config to connect to the AWS Metaflow API

{
"METAFLOW_BATCH_CONTAINER_IMAGE": "AWS-ACCOUNT-ID.dkr.ecr.REGION.amazonaws.com/container-name:latest",
"METAFLOW_BATCH_JOB_QUEUE": "arn:aws:batch:REGION:AWS-ACCOUNT-ID:job-queue/job-queue-metaflow-1",
"METAFLOW_DATASTORE_SYSROOT_S3": "s3://s3-bucket-name/metaflow",
"METAFLOW_DATATOOLS_SYSROOT_S3": "s3://s3-bucket-name/metaflow/data",
"METAFLOW_DEFAULT_DATASTORE": "s3",
"METAFLOW_DEFAULT_METADATA": "service",
"METAFLOW_ECS_S3_ACCESS_IAM_ROLE": "arn:aws:iam::AWS-ACCOUNT-ID:role/metaflow-1-BatchS3TaskRole",
"METAFLOW_EVENTS_SFN_ACCESS_IAM_ROLE": "arn:aws:iam::AWS-ACCOUNT-ID:role/metaflow-1-EventBridgeRole",
"METAFLOW_SERVICE_INTERNAL_URL": "http://URL-TO-ELB.elb.REGION.amazonaws.com/",
"METAFLOW_SERVICE_URL": "https://some-name.execute-api.REGION.amazonaws.com/api/",
"METAFLOW_SFN_DYNAMO_DB_TABLE": "metaflow-1-StepFunctionsState",
"METAFLOW_SFN_IAM_ROLE": "arn:aws:iam::AWS-ACCOUNT-ID:role/metaflow-1-StepFunctionsRole"
}

Execute on AWS Stack

python metaflow-example.py --no-pylint run --max-workers 5  --with batch

Executing job locally

python metaflow-example.py run

Example of the job logs

% python metaflow-example.py run
Metaflow 2.2.7 executing ParameterFlow for user:username
Creating local datastore in current directory (~/username/metaflow-example/.metaflow)
Validating your flow...
The graph looks good!
Running pylint...
Pylint is happy!
2021-03-31 15:42:09.982 Workflow starting (run-id 1617194529970328):
2021-03-31 15:42:09.987 [1617194529970328/start/1 (pid 80648)] Task is starting.
2021-03-31 15:42:10.384 [1617194529970328/start/1 (pid 80648)] alpha is 0.010000
2021-03-31 15:42:10.433 [1617194529970328/start/1 (pid 80648)] Task finished successfully.
2021-03-31 15:42:10.438 [1617194529970328/end/2 (pid 80652)] Task is starting.
2021-03-31 15:42:10.793 [1617194529970328/end/2 (pid 80652)] alpha is still 0.010000
2021-03-31 15:42:10.847 [1617194529970328/end/2 (pid 80652)] Task finished successfully.
2021-03-31 15:42:10.847 Done!

Scheduling

Metaflow helps users to develop, prototype and execute flows from their laptops that can scale easily by leveraging elastic storage and compute capabilities in the cloud. It happens quite often when these flows need to be run autonomously without any user participation. Metaflow makes it simpler to move the flow execution from Metaflow to AWS Step Functions to leverage all the feature sets that you get from a production grade scheduler — high availability, monitoring, reliability, etc. As a big advantage, with AWS EventBridge, users can set up triggers to execute these flows on a schedule automatically.

How to schedule a job?

By default, a flow on AWS Step Functions does not run automatically. You need to set up a trigger to launch the flow when an event occurs.

Metaflow provides built-in support for triggering Metaflow flows through time-based (cron) triggers. Use a time-based trigger if you want to trigger the workflow at a certain time.

Time-based triggers are implemented at the FlowSpec-level using the @schedule decorator.

Here is example with hourly scheduler:

from metaflow import FlowSpec, Parameter, step, schedule
@schedule(hourly=True)
class ParameterFlow(FlowSpec):
alpha = Parameter('alpha',
help='Learning rate',
default=0.01)

@step
def start(self):
print('alpha is %f' % self.alpha)
self.next(self.end)

@step
def end(self):
print('alpha is still %f' % self.alpha)

if __name__ == '__main__':
ParameterFlow()

You can define the schedule with @schedule in one of the following ways:

@schedule(weekly=True) runs the workflow on Sundays at midnight.
@schedule(daily=True) runs the workflow every day at midnight.
@schedule(hourly=True) runs the workflow every hour.
@schedule(cron='0 10 * * ? *') runs the workflow at the given Cron schedule, in this case at 10am UTC every day. You can use the rules defined here to define the schedule for the cron option.

Then you can schedule the job on the Metaflow cluster executed the command with step-functions create parameters:

python metaflow-example.py --with retry step-functions create

Summary

Metaflow is a modern and powerful framework that helps to run ML jobs faster and easier.

The main advantages are:

  • Deployment and maintenance of the infrastructure using AWS CloudFormation
  • Local jobs execution
  • Versioning
  • Autoscaling cluster
  • Jobs scheduler
  • Parallelizing your steps over multiple instances
  • Contributing by Netflix

If you have data science team and your goal is to improve/optimize the workflows, Metaflow is the best library to achieve your business needs and make it success.

URLs:

https://metaflow.org/
https://docs.metaflow.org/going-to-production-with-metaflow/scheduling-metaflow-flows
https://docs.metaflow.org/metaflow-on-aws/metaflow-on-aws
https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-schedule-expressions.html
https://github.com/Netflix/metaflow
https://github.com/Netflix/metaflow-tools/tree/master/aws/cloudformation
https://github.com/helli0n/metaflow-example

--

--