Machine Learning Observability

Best Practices in Building an Ops-Ready ML Pipeline

Josh Benamram
Databand, an IBM Company
5 min readApr 2, 2020

--

Our company Databand focuses on data engineering observability. We help teams monitor large scale production data pipelines, for ETL and ML use cases. We’re excited to lead a workshop on Machine Learning Observability and MLOps at the upcoming ODSC conference.

By the end of the workshop, you’ll learn how to structure data science workflows for production automation and introduce standard logging for performance measurement. In other words, make the process observable!

This blog describes what we’re covering in the workshop and the earlier webinar. You can find more info on the ODSC sessions and resources at the end of the page. Hope to see you there!

The Challenge

9 out of every 10 data science projects will fail to make it into production. It’s such a widely discussed problem it’s a cliche statistic. But for the lucky 10%, finally making it into production is not the end of the story. After you’re in production, you’re now in the business of maintaining your system.

Why is that hard to do? The “Ops” practices for data science and engineering are not yet defined. On top of that, the Ops professionals who focus on managing production applications are accustomed to a certain way of working and have expectations about the systems they operate. They have tools and best-practices for monitoring applications, with measurable performance indicators and testing procedures that give them confidence in deploying services to production. There’s nothing of this sort that helps Ops teams manage machine learning production today.

Our goal in the workshop will be to make a data science workflow “Ops-ready” for production.

What is Production Anyway?

Before we go any further, let’s define what we mean by “production” in the ML context because it’s not always straightforward.

There are usually two related activities in ML production:

  • Running the model — the process of using your model(s) to generate predictions on a live data set. Done as either an online (real-time) or offline (batch) process.
  • Maintaining the model — the process of running your model training workflow on new production data to retrain your model.

Retraining tells you if you need to update your model and is usually a scheduled process that runs on a weekly basis (give or take). Without retraining, models will degrade in performance as your data naturally changes.

For the workshop, we are focusing on maintenance/retraining. Why? If your model is not maintainable, all the value of “getting into production” will be pretty short lived. Second, when you have the right fundamentals and tools in place, maintenance is not so difficult to do. So the net gain is high.

The Workshop

During the workshop, we’ll start from a Python model training script in a Jupyter Notebook and transform it into a production retraining pipeline that’s observable, measurable, and manageable from the Ops perspective.

We are going to follow three steps to transform our training code into an observable pipeline.

  1. Functionalize our workflow
  2. Introduce logging
  3. Convert our code to a production pipeline (DAG)

To introduce logging and measurement into our script we’ll use Apache DBND — Databand’s open source library for tracking metadata and operationalizing workflows.

For running the production pipeline, we’ll use Apache Airflow, our preferred workflow scheduler. Airflow is our team’s go-to system for managing production workflows. Airflow’s great at two things: orchestration and scheduling. Orchestration is the ability to run workflows as atomic, interconnected tasks that run in a particular order. A workflow in Airflow is called a DAG. Scheduling is executing DAGs at a particular time. NOTE FOR THE WORKSHOP: We don’t expect attendees to be running Airflow, but we’ll use it from the presenter side to demonstrate our process in production.

Functionalizing

The first thing we’ll do to operationalize our workflow is functionalize it, splitting up our steps into discrete functions. The reason we do this is to make the script more modular and debuggable. When running in production, that will make it easier to isolate problems, especially as the workflow grows in complexity in future iterations.

Logging

Adding logging to our script will enable us to persist metrics and artifacts to an external system, collecting the metrics every time the Python runs. This is where DBND comes in. DBND will collect and store our metrics in our file system so that we can measure performance in a standardized way.

DBND will track our workflow on three levels:

  • Function input and output (in our example DataFrames and the Python model)
  • Data structure and schema
  • User defined performance metrics

Using these artifacts and metrics will make the workflow Ops-ready by enabling us to reproduce any run, maintain a history of record for performance, and make sure results are consistent at different stages of the development lifecycle (research, testing, production). We’ll be able to introduce standards that Ops can use going forward to monitor the process for issues.

To introduce tracking & logging, all we need to do is annotate our functions with DBND decorators and define our metrics with DBND’s logging API.

In research, we can visualize the metrics as a report directly in our Jupyter Notebook. In the workshop, we’ll show more operation oriented tools for observing metrics and performance in production.

Operationalizing

Our last step is transforming the workflow into a pipeline that we can run on Airflow as a scheduled DAG. After using the DBND library in our workflow, all we need to do to run the workflow as an Airflow DAG is to add a DAG definition that defines our functions as tasks and set the CRON schedule for the pipeline. When we add the DAG definition, each of our decorated functions will run as an Airflow task. As our pipeline runs on its schedule, DBND will continue to track input/output. data set information, and logging metrics and store them in our file system.

Wrapping Up

At the end of the workshop, we’ll have transformed a model training workflow built in a notebook into a retraining DAG that runs on regular schedule. We’ll have introduced standard logging and tracking that make sure the process is reproducible, testable, and measurable. With this infrastructure, data scientists will be more productive in pushing research into market, and Ops will feel confident that the production ML system is maintainable.

What’s next? Rinse and repeat!

Resources

Here is a link to our repo for the workshop.

The ODSC webinar is scheduled for 1:00pm EST on April 7th. Here is the link for signup: https://register.gotowebinar.com/register/5659030622578494477

The interactive workshop itself will be held on April 15th at 9:30am EST. You can visit the ODSC website for more info on how to register.

--

--