Machine Learning Operations

Automate model development (Part 1)

Why you should and how you can

Alex Hasha
Mission Lane Tech Blog

--

It’s 2024, so your team probably has some impressive automation for testing and deploying any software to a production environment, including Machine Learning and AI models. Can you say the same for your model development environment?

Automating model development versus “AutoML”

When we talk about automating model development, we’re not suggesting it’s time to fire all the Data Scientists and replace them with AI. We’re also not talking about “AutoML” tools, which are products used to rapidly identify and tune the highest-performing machine learning algorithm for a given predictive problem. Instead, we’re talking about automating the full sequence of steps required to extract and transform data to train and evaluate a candidate model. These steps are frequently executed manually dozens of times before a champion model is selected for deployment.

In Part 1 on “Why to Automate”, we’ll discuss the compelling reasons why Data Science teams should invest in automating model development.

Part 2 on “How to Automate” describes how Mission Lane builds fully automated model training and evaluation pipelines using open-source software and how that changes the way our Data Science team delivers value for the organization. In later posts, we’ll offer tips and tricks to help you automate your model development processes with these tools.

Let’s dive in!

A futuristic spaceship representing your model deployment environment is tethered to old-fashioned horse-drawn buggies representing your model development environment.
“My model development architecture may be metaphorically equivalent to a horse-drawn carriage, but at least it’s AI-generated and has a team of five-legged horses!”

Model development workflows accumulate complexity rapidly

From our shared experience with Data Science teams at multiple organizations, the single most common software architecture for model development is a git repository full of jupyter notebooks with the associated data stored outside of version control in a shared file-system or cloud bucket storage. This architecture produces projects that unfold more or less like this:

  1. Someone comes to the data team with a question or hypothesis. Can we use this new dataset or algorithm to get a more accurate prediction of the thing?
  2. The Data Scientist, ever eager to investigate, fires up a fresh notebook. She finds and modifies some related SQL query to extract the relevant data from the warehouse, writes a bunch of pandas logic to transform it for analysis, then trains a model using the latest hotness from scikit-learn.
  3. The results look good! Everyone is excited and the project is green-lighted! Time to incorporate the other datasets that it made sense to skip for the sake of prototyping, and do a more careful round of exploratory data analysis. The notebook’s getting too long, so she splits it up: one for data extraction, another for transformation, and another to train and test the model.
  4. Now she has a de-facto “model pipeline” consisting of 3–4 notebooks that need to be run in the right order, writing intermediate datasets and charts to a folder hierarchy that she made up as she went along.
  5. Every time she runs “the pipeline”, the stakeholders have new questions and ideas, which require another run through the pipeline. She’s the only one on the team who can do it quickly because she knows all the places where the cells have to be executed in a particular order, or where you need to manually change a filename to use that dataset with the third quarter assumptions.
  6. The project has half a dozen stakeholders who each want to see the model’s performance evaluated on segments they care about, using their favorite plot, with slightly different assumptions. These requests add a few more notebooks to “the pipeline”.
  7. Because notebooks diffs look awful in GitHub, and because notebooks mash together code and results, the Data Scientist is reluctant to use git “correctly” and overwrite notebooks from previous iterations of analysis. It’s easier just to copy and tweak the notebooks that needed to be changed for each of these iterations.
  8. Letting your manual workflow develop organically like this, you can easily end up with 15 to 150 notebooks in a late-stage model development project. Some of these are part of the core “pipeline”, others are stale versions from previous iterations, and still others are one-off analyses or dead ends. Only their author knows which is which.

The project code gets complicated because the model development process IS complicated. While notebooks are a simple and flexible tool to get started with prototyping, they don’t give Data Scientists the tools they need to manage complexity effectively as the project matures. It doesn’t help that Data Scientists are used to being praised for their ability to manage complexity in their head. They can handle it, but they end up squandering intellectual bandwidth maintaining a mental map of the project, and carefully orchestrating the execution of each iteration when they should be delegating these tasks to a computer.

Inefficiency isn’t the only drawback to managing model development complexity in your head. It is also more error-prone and makes hand-offs difficult. When it comes time to start building the next version of the model, the next Data Scientist takes one look at that hot mess of a notebooks folder and decides to start fresh with a new notebook…

Analysis is an iterative process

This story illustrates how analysis is an intrinsically iterative activity. Each dive into the data is as likely to raise new questions or generate new hypotheses as it is to yield a conclusive result. Consequently, most analytical code will be run many times with small variations before a project is complete, and a mature analytical pipeline can take several hours to run. As Randall Munroe, the great sage of science nerds, illustrates, this is a sweet spot for automation.

Sources: https://xkcd.com/1459/ and https://xkcd.com/1205/

But what exactly should be automated, and how? Products marketed as “AutoML” tend to focus on the innermost exploratory loop (algorithm selection, hyperparameter tuning, and feature selection), searching for a model that’s optimal given a fixed performance metric and fixed training and evaluation data. “MLOps” tools frequently focus on specializing CI/CD testing, metadata logging, and deployment tools for the Machine Learning context. While these tools are useful and should occupy a node in your pipeline, the entire process, from pulling raw data to the decision artifacts and documentation, is an automation opportunity. Anything you do the same way, in the same order, for multiple iterations can and should be automated.

Another way to think about the correct scope of automation is to strive for automatic reproducibility of the key results that will drive decision-making. A second analyst should be able to achieve the same results using only the original analysts’ documentation, data, and code, and shouldn’t need to schedule meetings to find them. If you achieve this level of automation, you can push complex but routine processes into the background, and save precious human attention for interpreting results and making good choices about what hypothesis to test in your next experiment.

Conclusion

In summary, we think there is a simple and compelling case for building automation in your model development environment, and not just in your pipeline to production.

  1. Manual model development workflows accumulate complexity rapidly: they are hard to reproduce even for the original developer and mistakes are likely
  2. Model development is iterative: it requires frequent repetition of a long sequence of steps with small modifications
  3. As a result, automating these repetitive tasks saves time and makes results more consistent and verifiable as well.

In Part 2, we’ll show that building this automation may be easier than you think. Advancements in the open-source scientific computing ecosystem have dramatically reduced the barrier to entry. We’ll describe how Mission Lane used open-source software to build fully automated model training and evaluation pipelines and how that changes the way our Data Science team delivers value for the organization.

--

--

Alex Hasha
Mission Lane Tech Blog

Experienced Data Science leader with backgrounds in finance and climate science and expertise in model development, deployment, and risk management.