MLOps Lab #3 : Continuous Delivery a TensorFlow Model on Red Hat® OpenShift (OKD) with SAS® Model Manager and SAS® Workflow Manager

Ivan Nardini
MLOps.community
Published in
9 min readOct 27, 2020

Premises: The scope of the article is to summarize a proof of concept I’ve been working on over last month. It involves different concepts and technologies. Depending on your interest, I would consider a virtual sharing to answer all your questions. So, feel free to leave comments.

Figure 1. ModelOps Process, Environments and Tools

That’s my third article of MLOps Lab Series.

But, compared to the previous ones,

  1. MLOps Lab #1 : Batch scoring with Mlflow Model (Mleap flavor) on Google Cloud Platform
  2. MLOps Lab #2 : Deploy a Recommendation System as Hosted Interactive Web Service on AWS

this time it’s a bit different. At least for three reasons.

Indeed, I feel more involved. It lets me show what I do at job everyday. What it means to be a person passionate who develops analytics application with open sources at SAS.

Also, because I work with partner in crimes Artjom Glazkov, it’s an real example of how collaboration is one of the powerful value at that company.

Last but not least, It’s my first article for MLOps Community. Thanks Demetrios Brinkmann.

Said that, below the Table of Content:

  1. The Scenario
  2. The Project: Business Case, Process, Environments and Tools
  3. SAS® for Continuous Delivery Machine Learning Models
    3.1 SAS® Model Manager as Model Registry
    3.2 SAS® Workflow Manager for Automation
  4. Final Considerations
  5. Summary
  6. References

So, let’s jump into the scenario.

1. The Scenario

Since I start working on ModelOps, customer asks for integrating Machine Learning Environments.

For what I care, ModelOps should solves Machine Learning System Integration Challenges!

No matter where they are (on-premise, on-cloud) or what technologies are involved (Free and open or Non-free software), the conversation goes kinda like that:

Assume that I orchestrate Model Training on scale with <Orchestrator> in <Training environment>…

Now Model Deployment is in <Production environment>…

But, I do need a Model Governance framework to manage the entire model lifecycle automatically between <Training environment> and <Production environment>…

In particular, in this scenario he says

..But I do need a Model Registry to version Tensorflow models and deploy the associated serving docker images on Openshift. And once in production, I want to monitor them as well.

So the question:

Figure 2. Customer Question

Let’s see if we can do that ;)

2. The Project: Business Case, Process, Environments and Tools

For POC purpose, I consider a credit scoring business application where the consumer credit department of a bank wants to automate the decision making process for approval of home equity lines of credit. The model is based on HMEQ data collected from recent applicants granted credit through the current process of loan underwriting. It is composed by a subset of 12 predictor (or input) variables and the response (or target) variable (BAD) indicates whether an applicant defaulted or not.

Below the high-level architecture of the solution I propose.

Figure 3. Solution Architecture

Then, I assume that

  1. Data Scientist runs TensorFlow model experiments in Development environment and track them using Mlflow.
  2. He/She registers the Champion candidate in SAS® Model Manager with SAS pzmm and sasctl library. The Champion model is subjected to a validation process. If it passes, the model is deployed on RedHat Openshift (OKD) thanks to SAS Workflow Manager using Google’s Tensorflow serving image in a OKD project previously created by IT Cluster Admin.
  3. Because the demo, IT deploys an client application stack to simulate scoring requests. It includes a dedicated sidecar container for pushing logs directly to a backend Logs are store in a PostgresSQL database.
  4. Then, Logs are consumed by performance monitoring service that sends a notification in case model underscores.
  5. Time goes and model starts underperformed then SAS® Workflow Manager triggers automated retraining based on the field data and sends a message in Microsoft Teams
  6. Data scientist receives the notification and he/she starts a new training process.

Now that you know more about the project, we can dive into the role of SAS® Model Manager and SAS® Workflow Manager.

3. SAS® for Continuous Delivery Machine Learning Models

In a famous blog article, Martin Flower states that

A Continuous Delivery orchestration tool…governs how models and applications are deployed to production

At SAS, we have two guys that do that job

  • SAS® Model Manager is a ModelOps platform to register, validate, deploy in production, monitor and retrain your models
  • SAS® Workflow Manager is the ModelOps orchestrator. It provides tasks automation (for example, you can send email notification or executes specifics jobs) thanks to workflow definitions. They represent both directed acyclic (DAG) and cycle graphs. It is important, in sense of making some part of the process repeatable (such as constant reviewing model performance).

The SAS Persuaders for CD4ML!

And, because their integration, they allow to cover the Continuous Delivery for Machine Learning end-to-end process described by Flower.

3.1 SAS® Model Manager as Model Registry

In our scenario, because Tensorflow model was previously validated, SAS® Model Manager represents just a Model Registry.

To version the Champion model, I create a zip package of the model with minimum requirements (model variables and model properties) with SAS® pzmm module and then I register it with SAS® sasctl, a package that enables easy communication between the SAS® Viya platform and a Python runtime.

Here the register function I code

Below the Tensorflow_BoostedTreesClassifier model versioned in the sas_modelops_tensorflow_openshift project.

Figure 4. Model Registration

3.2 SAS® Workflow Manager for Automation

Once I version the model, the rest of the demo is “orchestrated” by SAS® Workflow Manager.

In fact, for continuous delivery, we need to automate a process that

  1. Allow user to validate the model as Champion.
  2. Build an serving image with validated Champion Model using Google’s Tensorflow serving base
  3. Deploy the image via registration in RedHat Openshift (OKD)’s docker registry
  4. Monitor the model in production
  5. Retrain the model in case if model starts underperforming.

And, of course, send mail and Microsoft’s Teams notification for each of the steps to Data Scientist and IT people.

Below you can see the workflow definition that we have built for covering all these steps.

Figure 5. The Workflow Definition

Just to give you some elements, the workflow is represented by

  • sequence flows (arrows) between workflow diagram elements that indicate the order in which tasks are executed.
  • processes those are a collection of activities designed to produce a specific output for a particular objective, potentially involving both human (user task, gray box with little man) and system interactions (service task, gray box with gear). In particular, service tasks invoke an external actions like REST Web Service and Job Execution.
  • subprocesses (yellow boxes) which are compound activities or workflow to deal with complexity

all them are finally controlled by

  • gateways (diamond with X) those control the execution through an instance. In our case, we have exclusive gateway for “if-then-else” logic.

For each project, you can start the workflow instance of a workflow definition that executes each task.

Now assume that we start the workflow above in our project.

Because this article don’t want to be annoying, let me focusing on its core components.

Those are:

  1. Pre-Build TF Serving Image task
  2. Build TF Serving Image task
  3. Deploy TF Serving on Openshift (OKD) task
  4. Production stage subprocess

STEP 1: Pre-Build TF Serving Image task

The “Pre-Build TF Serving Image” Task executes 0_prebuild.sh which is paired with prebuild.py custom package using SAS Job. The packages downloads the model artifact on the server using ModelRepository REST service and a configuration file environment.yaml.

Below you can find the SAS Job

that wraps the 0_prebuild.sh

And this is a view of the “Pre-Build TF Serving Image” service task in SAS® Workflow Designer.

Figure 6. Inside Prebuild service task

As you might have guessed, all services tasks exploit SAS® Viya capability to controls the call of OS executables from SAS Job code (XCMD property).

Then, I’ll not explicit it furthermore.

STEP 2: Build TF Serving Image task

The “Build TF Serving Image” Task executes 1_build.sh which builds the image on the local registry based on Model artifact name and a temporary Tensorflow serving container.

Here the content of 1_build.sh script.

STEP 3: Deploy TF Serving on Openshift (OKD) task

The “Deploy TF Serving on Openshift (OKD)” Task executes 2_deploy.sh which pushes the new Champion Model Serving image on OpenShift Container platform remotely.

That’s an example of a particular registration.

Below the notification I receive on Teams and the image registered in OKD’s docker registry.

Figure 7. An example of MSTeams Notification

So,

Tensorflow model is in production on OKD!

Thanks SAS Workflow Manager =)

STEP 4: Production stage subprocess

Now the model is in production. It scores and logs are store in a PostgresSQL container on OKD.

Figure 8. Score TF model. Score.

We should be happy, right?

Hell, NO! =)

Indeed, customer challenges us asking if SAS can monitor the model and trigger retrain if needed?

It’s time to monitor the model and retrain if needed…

This sounds like a bit complex right?

But, fortunately for us, we have subprocess those are useful when you have to deal with complexity in a workflow.

In our case, we define a subprocess to handle the production stage operations you can see below.

Figure 9. Production stage subprocess

In particular,

  1. It runs a model performance monitoring job
  2. If the job successes, then it stores the value of one particular statistics (KS in this case).
  3. Finally, if the KS value is under a minimum threshold (0.45 in the example), it means that model is underperforming. Then, a notification is sent and the model retraining is automatically triggered.

Of course the “run retraining” task trains the model on the data collected in the PostgresSQL database for performance monitoring and registers a new version of the model once the retrain ends.

Here you have an example of Performance Monitoring dashboard we get.

Figure 10. An example of Monitoring

Below the build_train_pipeline function of train.py

and some code of register.py

executed by the run retraining task thanks to the associated job.

At the end, because our model underscores, retrain is triggered and the new version of the model is registered in SAS® Model Manager. And of course, the production stage ends successfully and a new model life cycle may starts again!

Figure 11. Production stage ends successfully

4. Final Considerations

Honestly, I’m speechless for final considerations this time. Then I’ll jump to the summary.

Just let me say one thing

What a hell of project!

5. Summary

I start the article with the customer’s question:

Can SAS operationalize Tensorflow Models on Openshift?

Well, the answer is

Figure 12. Me and Artem. A great squad.

Indeed, I show how the integration between SAS® Model Manager and SAS® Workflow Manager allows us to cover the Continuous Delivery for Machine Learning end-to-end process.

As always, I personally learn a lot of new things:

  • How training a Tensorflow model using the TF framework
  • How manage a open source project with Git, Trello, Teams with collegues
  • Different ways SAS has to work with 3th party systems

and I could go on and on…

For now, below you can find the project repository on GitHub

And, as I mentioned at the begin, we’re considering to have a virtual sharing to answer all possible your questions.

Then if you’re interested, clap this article or leave comments.

Enjoy yourself and feel free to reach me out on Linkedin.

6. References

  1. https://documentation.sas.com/?docsetId=mdlmgrug&docsetTarget=titlepage.htm&docsetVersion=15.3&locale=en
  2. https://documentation.sas.com/?cdcId=wfscdc&cdcVersion=2.3&docsetId=wfswn&docsetTarget=n1p5gi4d815tr7n1r3zeto7jqcfw.htm&locale=en
  3. https://access.redhat.com/documentation/en-us/red_hat_container_development_kit/3.0/html-single/getting_started_guide/index#minishift-delete
  4. https://www.openshift.com/blog/remotely-push-pull-container-images-openshift
  5. https://communities.sas.com/t5/SAS-Communities-Library/SAS-Viya-3-5-SAS-Studio-and-SAS-Compute-Server-non-functional/ta-p/616617
  6. https://learn.openshift.com/
  7. https://github.com/derekmahar/docker-compose-wait-for-file/tree/master/ubuntu-wait-for-file
  8. https://blogs.sas.com/content/sasdummy/2019/09/05/sas-microsoft-teams/

--

--

Ivan Nardini
MLOps.community

Developer relations engineer at @GoogleCloud who is passionate with Machine Learning Engineering. The Lead of MLOps.community’s Engineering Lab.