Automating Your ML Models Like a Pro Using AirFlow, SAS Viya, & Docker

Sathish Gangichetty
The Startup
Published in
9 min readAug 2, 2020

Ok, here’s a scenario: You’re the lone data scientist/ML Engineer in your consumer-focused unicorn startup, and you have to build a bunch of models for a variety of different business use cases. You don’t have time to sit around and sulk about the nitty-gritty details of any one model. So you’ve got choices to make. Decisions. Decisions that make you move fast, learn faster, and yet build for resilience all while gaining a unique appreciation for walking the talk. IF you do this right (even partly), you end up becoming walking gold for your company. A unicorn at a unicorn 😃. Why? Because, you put the customer feedback you observed through their data-trail back to work for your company, instead of letting it rot in the dark rooms of untapped logs and data dungeons (a.k.a. databases). These micro-decisions you enable matter. They eventually add up to push your company beyond the inflection point that is needed for exponential growth.

So, that is where we start from. And build. We’ll assume we can choose tech that simplifies everything for us, yet letting us automate all we want. When in doubt, we’ll simplify — remove, until we can rationalize effort for the outcome to avoid over-engineering stuff. That is exactly what I’ve done for us here — so we don’t get stuck in an analysis/choice paralysis.

When presented with a choice, always pick for the balance between realities and possibilities

Note, everything we’ll use here will be assumed to be running on docker unless mentioned otherwise. So based on that we’ll use …

  • Apache Airflow for orchestrating our workflow: Airflow has quickly become the de-facto standard for authoring, scheduling, monitoring, and managing workflows — especially in data pipelines. We know that today, at least, 350 companies use Airflow in the broader tech industry along with a variety of executors and operators including kubernetes and docker.
  • The usual suspects in the python ecosystem: for glue code, data engineering etc. The one notable addition would be vaex for processing large parquet files quickly and doing some data prep work.
  • Viya in a container & Viya as an Enterprise Analytics Platform(EAP): SAS Viya is an exciting technology platform that can be used to quickly build business focused capabilities on top of foundational analytical and AI models that SAS produces. We’ll use two flavors of SAS Viya — one as a container for building and running our models, and another one running on Virtual Machine(s) which acts as the enterprise analytics platform that the rest of our organization uses to perform analytics, consume reports, track and monitor models etc. For our specific use case, we’ll use the SAS platform’s autoML capabilities via the DataSciencePilot action set so that we can go full auto-mode on our problem.
  • SAS Model Manager to inventory, track, & deploy models : This is the model management component on the Viya Enterprise Analytics Platform that we’ll use to eventually push the model out to the wild for scoring.

Now that we’ve lined up all the basic building blocks, let’s address the business problem: We’re required to build a churn detection service so that our fictitious unicorn can detect potential churners and follow up with some remedial course of action to keep them engaged, instead of trying to reactivate them after the window of opportunity lapses. Because we plan to use Viya’s DataSciencePilot action set for training our model, we can simply prep the data and pass it off to dsautoml action which, as it turns out, is just a regular method call using the python-swat package. If you have access to Viya, you should try this out if you haven’t already.

Also, if you didn’t pick up on it yet, we’re trying to automate everything including the model (re-)training process for models developed with autoML. We want to do this at a particular cadence as we fully expect to create fresh/new models whenever possible to keep up with the changing data. So automating autoML. Like Inception…😎

Anyway, remember: You’re the lone warrior in the effort to spawn artificial intelligence & release it into the back office services clan that attack emerging customer data and provide relevant micro-decisions. So there’s not much time to waste. So, let’s start.

We’ll use a little Makefile to start our containers (see below) — it just runs a little script that starts up the containers by setting up the right params and triggering the right flags when ‘docker run’ is called. Nothing extraordinary, but gets the job done.

Start the containers for model development

Now, just like that, we’ve got our containers live and kicking. Once we have our notebook environment, we can call autoML via the dsautoml action after loading our data. The syntactic specifics of this action are available here. Very quickly, a sample method call looks like this :

# sess is the session context for the CAS session.sess.datasciencepilot.dsautoml(table = out, target = "CHURN",
inputs = effect_vars,
transformationPolicy={"missing":True,"cardinality":True,
"entropy":True, "iqv":True,
"skewness":True, "kurtosis":True,"outlier":True},
modelTypes = ["decisionTree", "GRADBOOST"],
objective = "AUC",
sampleSize = 20,
topKPipelines = 10,
kFolds = 2,
transformationout = dict(name="TRANSFORMATION_OUT", replace=True),
featureout = dict(name="FEATURE_OUT", replace=True),
pipelineout = dict(name="PIPELINE_OUT", replace=True),
savestate = dict(modelNamePrefix='churn_model', replace = True))

I’ve placed the entire notebook in this repo for you to take a look, so worry not! This particular post isn’t about the specifics of dsautoml. If you are looking for a good intro to automl, you can head over here. You’ll be convinced.

As you will see, SAS DataSciencePilot (autoML) provides fully-automated capabilities that allow for multiple actions including automatic feature generation via the feature machine, which auto resolves the transformations needed and then using those features for constructing multiple pipelines in a full-on leaderboard challenge. Additionally, the dsautoml method call also produces two binary files. One for capturing the feature transformations that are performed and then another one for the top model. This means we get the score code for the champion and the feature transformations, so that we can deploy them easily into production. This is VERY important. In a commercial use case such as this one, model deployment is more important than development.

If your models don’t get deployed, even the best of them perish doing nothing. And when that is the deal, even a 1-year old will pick something over nothing.

what your response SHOULDN’T be to how many ML models do you actually deploy?

This mandates us to always choose tools and techniques that meet the ask, and potentially increase the range of deployable options while avoiding re-work. In other words, your tool and model should be able to meet the acceptable scoring SLA of the workload for the business case. And you should know this before you start writing a single line of code. If this doesn’t happen, then any code we write is wasteful and meets no purpose other than satisfying personal fancies.

So, now that we have a way to automatically train these models on our data, let’s get this autoML process deployed for automatic retraining. This is where Airflow will help us immensely. Why? When we hand off “retraining” to production — there are bunch of new requirements that pop up such as:

  • Error handling — How many times to retry? What happens if there is a failure?
  • Quick and easy access to consolidated logs
  • Task Status Tracking
  • Ability to re-process historic data due to upstream changes
  • Execution Dependencies on other processes: For example, process Y needs to run after process X, but what if X does not finish on-time?
  • Tracing Changes in the Automation Process Definition Files

Airflow handles all of the above elegantly. And not just that! We can quickly set up airflow on containers, and run it using docker-compose using this repo. Obviously you can edit the Dockerfile or the compose file as you see fit. Once again, I’ve edited these files to suit my needs and dropped it in this repo so you can follow along if you need to. At this point when you run docker-compose you should see postgres and airflow web server running

Next, let’s look at the Directed Acyclic Graph (DAG) we’ll use to automatically rebuild this churn detection model weekly. Don’t worry, this DAG is also provided in the same repo.

ML DAG set up to run weekly

Now, we’ll click into the graph view and understand what the DAG is trying to accomplish step by step.

Airflow DAG for automating autoML and registering models to SAS Model Manager
  1. We start by checking if a particular file exists. In our case, we start with a parquet file that we expect to see in a specific directory every week before we begin the process
  2. Then we check for the readiness of the CAS container (viya container)
  3. Following this, we prepare the analytics base table using vaex to quickly add couple of additional columns to our file
  4. We then summon autoML and make DataSciencePilot figure out what the best model is for the data
  5. Next we save all the relevant model artifacts, including a summary of all the top 10 pipelines generated so we get a view of the leaderboard. We also check the readiness of our Viya Enterprise Analytics Platform.
  6. Now, we send an email out to our data science stakeholders with the leaderboard as an attachment to keep them informed about the automated run
  7. Finally, we register these new champion model artifacts to SAS Model Manager, which is a part of our Viya Enterprise Analytics Platform.

And that’s it! Our process is ready to be put to test!

A sample post-run gantt chart

When the process finishes successfully, all the tasks should report success and the gantt chart view in airflow should resolve to something that looks like the one above (execution times will obviously be different). And just like that, we’ve gotten incredibly close to the finish line.

We’ve just automated our entire training process! Including saving our models for deployment and sending emails out whenever the DAG is run. If you look back, our original goal was to deploy these models as consumable services. We could’ve easily automated that part as well, but our choice of technology (SAS model manager in this case) allows us to add additional touch points, if you so desire. It normally makes sense to have a human-in-the-middle “push button” process before engaging in model publish activities, because it factors in buffers if upstream processes go wonky for reasons like crappy data, sudden changes in baseline distributions etc. More importantly, pushing models to production should actively bring in conscious human mindfulness to the activity. Surely, we wouldn’t want an out-of-sight process impacting the business wildly. Doing this ‘human-in-the-middle’ activity also significantly eliminates the unnecessary need to engage in post-hoc explanations as backtesting comes to the fore.

Ok, lets see how all of this works real quick:

Deploying our autoML model

Notice that SAS Model Manager is able to take the model artifacts and publish them out as a module in a micro analytic service, where models can be consumed using scoring endpoints. And just like that, you are able to flip the switch on your models and make them respond to requests for inference.

There’s obviously no CI/CD component here just yet. That’s intentional. I didn’t want to overcomplicate this post since all we have is just a model here. I’ll come back and write a follow up on that topic on another day, at a later time, with another app. But for now, let’s rejoice in how much we’ve managed to get done automagically with Airflow & SAS Viya in containers.

Through thoughtful intelligent automation of mundane routines, using properly selected technology components, you can now make yourself available to focus on more exciting, cooler, higher order projects, while still making an ongoing impact in your unicorn organization through your models. Your best life is now. So why wait, when you can automate? 🤖

Connect with Sathish on LinkedIn

--

--

Sathish Gangichetty
The Startup

I’m someone with a deep passion for human centered AI. A life long student. Currently work @ databricks