Machine Learning Operations

Automate model development (Part 2)

Why you should and how you can

Alex Hasha
Mission Lane Tech Blog

--

This is Part 2 of our series on how and why to automate model development. See here for Part 1, which explains how model development automation is different from AutoML and MLOps, and why investing in it is probably more important than you realize. In Part 2, you’ll learn that building this automation may be easier than you think.

Advancements in the open-source scientific computing ecosystem have dramatically reduced the barrier to entry. This post describes how Mission Lane uses open-source software to achieve end-to-end reproducibility from raw data to the key modeling results that drive decision-making. This allows us to push complex but routine processes into the background, saving precious human attention for innovation and problem-solving.

The tools to automate model development are readily available

This level of automation can be achieved with established tools and patterns. First, let’s break down the requirements. Minimally, to reproduce a computational task you need to:

  1. Gather the same input data
  2. Build and install the same code and software libraries
  3. Re-run the process steps in the right order with the same configuration settings

Additionally, to reduce the cognitive load on the humans overseeing these tasks, you’ll want to:

4. Systematically track the revision history of your project and which revisions were used to produce results

5. Keep code and documentation in sync

6. Maintain tidy, well-organized code and data

7. Automate testing and apply it to every code revision

This may sound like a lot to manage, but if you’re a Python coder the tools to solve each of these problems are just a “pip install” away (or, if you follow our advice, a “poetry add” away). We’ve developed a standard software stack and project structure that enables the reproduction of a model and its evaluation artifacts in one command. (Three if you’re setting up the environment for the first time).

Here are the tools we use at Mission Lane:

(1) Gather the same input data, and (3) re-run the same process steps

For us, DVC was the last missing piece enabling a truly automated reproducible model development workflow. While almost everyone is using code versioning tools like git, they don’t work well with large data sets. This leads to Data Scientists being urged to version their code but, paradoxically, not their data. Because of this, reproducing someone else’s work usually requires a few days or weeks of archaeological excavation in a colleague’s home directory, trying to understand how the data fits with the code.

Before DVC, most data version control tools were hosted services tightly coupled to particular platforms for data storage and compute, which made them difficult to adopt in the context of a large corporation where hosted infrastructure must go through layers of architectural, budget, and security approvals. In contrast, DVC is an open-source Python package that can be downloaded and installed in seconds. It stores large files and datasets outside of git, while storing lightweight metadata in git that links code versions to data versions. It is extremely flexible and integrates with any major cloud storage provider, such as AWS S3, Google Cloud Storage, or Microsoft Azure Blob Storage. This means it can be adopted without disrupting existing data storage solutions in an organization. DVC users can quickly create a remote data repository to give teammates and stakeholders access to project data as easily as “git pull” gives access to code.

In addition to data versioning, DVC also provides a lightweight way to automate complex workflows by defining them as a Directed Acyclic Graph (DAG) of tasks and their dependencies. As with data versioning solutions, many more popular DAG schedulers rely on hosted infrastructure with steep learning curves, so DVC’s solution is refreshingly simple to use.

Seriously, go try out DVC now. It’s amazing.

(2) Build and install the same code and software libraries

The first step in reproducing an analysis is to match the computational environment where it was run. You need the same tools, the same libraries, and the same versions to make everything play nicely together. Python is notorious for making this difficult, even though there are a bunch of tools that are supposed to do this: pip, virtualenv, conda, etc.

https://python-poetry.org/

poetry is a tool for dependency management and packaging in Python. Like the other tools mentioned, it allows you to declare the libraries your project depends on and it will manage (i.e., install and update) them for you. Unlike most Python environment management tools, Poetry ensures repeatable installs using a “lockfile” pattern.

From the perspective of reproducibility, a key thing to consider when choosing among these tools is the subtle differences in their goals. pip and conda are designed to be flexible — they want you to be able to install a given package with as wide a range of versions of other packages in your environment as possible, so you can construct environments with as many packages as you need.

This is great news for productivity, but bad news for reproducibility, because by default they’ll install the latest version of any dependency that’s compatible with the other declared dependencies. pip and conda have the capability to “peg” requirements to specific versions, but that isn’t the default behavior and it requires quite a bit of manual work to use the tools that way, so people don’t reliably do it.

Here’s a specific example. Suppose you want to use a particular version of pandas as follows:

When you use pip this way, you get the latest version of numpy compatible with the pandas requirement numpy>=1.21.0, which in January 2024 is version 1.26.3. If you run this same installation procedure again after the next numpy release, you will get a different version of numpy which could behave differently.

Worse, package developers are often sloppy about testing all possible combinations of dependency versions, so it’s not uncommon that packages that are “compatible” in terms of their declared requirements don’t work together. We’ve seen it happen many times that an environment managed with pip or conda that worked a few months ago no longer works when you do a clean install today.

Managing the dependency environment with poetry avoids these issues because it automatically generates a “lock file” listing the versions of every software package installed. Subsequent installs will follow the lock file, so you always get numpy version 1.26.3 no matter when you try to recreate the environment.

(6) Tidy, well-organized code

Writing code in a consistent, standard style makes your code easier for others to read, debug, and maintain. While there will always be a few holdouts who continue to argue over whether to indent with spaces or tabs, the Python community has aligned on the PEP8 style conventions. When we were first learning Python, you had to learn these conventions in detail to survive a code review with an experienced developer. Now, adding the following tools to your project will automate 98% of the effort needed to follow the style standard, and will gently prompt you through the rest.

(7) Test automation

https://docs.pytest.org/en/8.0.x/

pytest — The pytest framework makes it easy to write small, readable tests, and can scale to support complex functional testing for applications and libraries. The majority of Python projects use either pytest or unittest, and in my opinion, pytest is more modern and flexible, and easier to use.

https://docs.github.com/en/actions

GitHub actions — Build automation tools make it simple to ensure that your automated tests run every time someone tries to update the project’s code so that the code found there is always guaranteed to pass your tests.

Bringing it all together with a project template

It takes time to master all of these tools to the point where you can set up a project with them from scratch. On the other hand, most data scientists are comfortable modifying an existing project that gives them a clear pattern to follow. At Mission Lane, we have developed a data science project template that takes care of all the boilerplate needed to configure and integrate the tools. It lets a user get started with just a few standard setup commands.

Using project templates this way is common practice in other corners of the software engineering world, though it hasn’t caught on broadly with data scientists. Major web development frameworks like Django or Ruby on Rails offer good examples. Nobody sits around before creating a new Rails project to figure out where they want to put their views; they just run “rails new” to get a standard project skeleton like everybody else. Because that default project structure is logical and standard across most projects, it is much easier for somebody who has never seen a particular project to figure out where they would find the various moving parts. Ideally, that’s how it should be when a colleague opens up your data science project.

Our template is implemented using the popular cookiecutter tool, which uses jinja templating to dynamically generate a new model development project customized to project requirements. To start a new project, a data scientist simply runs a command and answers a short set of questions as shown below:

The cookiecutter creates a project folder with a standard organizational structure based on drivendata’s Cookiecutter Data Science. There is a simple folder structure for organizing versioned project data. DVC enables this to be backed by remote storage, but tracked data is always easy to find within the project workspace. Project code is structured as a standard, installable Python package for portability. The top-level directory contains functional default configurations for all of the automation tools discussed above, and a Makefile for automating typical project operations such as installation, testing, and running the pipeline.

The project template includes a high-level pipeline structure that is appropriate for most model builds, enabling a Data Scientist to fill in project-specific details without reinventing the wheel.

By following the pattern, a data scientist achieves end-to-end reproducibility: a single command takes them from raw data to a candidate model with all the evaluation artifacts needed by the business. Because the template makes this achievable, we include reproducibility in the “definition of done” for a new model and we verify it in our model validation process. This makes model development more transparent but also lets the data scientist be more responsive to the questions and concerns that inevitably arise as more stakeholders are engaged later in model development. In the past, there were inevitably painful trade-offs between addressing every issue and the days of manual effort required to run through the pipeline again. By converting days of human effort into hours of computer time, we can dot more i’s and cross more t’s before moving forward.

Another major benefit of this approach is that it has converted a model “refresh” from a multi-week project to an automated monitoring process. Because our portfolio has grown rapidly in recent years, we can often achieve significant model performance improvement simply by re-fitting using the same methodology and more data. Without model development automation, it was a difficult judgment call to decide when there was enough new data to justify spending the time to do this. Now, our internal “auto-model” platform can run these refreshes automatically on a regular cadence, giving us a clear signal when new data has provided enough of a performance boost to consider releasing an update.

Finally, a consistent project structure allows us to be more flexible and responsive with our staffing decisions. It lets a newcomer dive into a project without having to spend a lot of time getting oriented. They don’t have to read all of the code or schedule knowledge transfer meetings to know where to look for standard project elements. Conventional problems are solved in conventional ways, leaving more energy to focus on the unique demands of each project. This means our managers have more flexibility to move team members between projects and domains because less “tribal knowledge” is required to get up to speed.

Conclusion

Hopefully, we’ve convinced you that automating model development can be a game changer for any Data Science team. Through Mission Lane’s automation journey using open-source tools, you’ve seen that it doesn’t have to be expensive or complicated. Stay tuned for upcoming posts, where we’ll share practical tips and tricks to help you apply these tools successfully in your model development projects!

--

--

Alex Hasha
Mission Lane Tech Blog

Experienced Data Science leader with backgrounds in finance and climate science and expertise in model development, deployment, and risk management.