🛂 Checkpoint: A Better Process for Promoting Models to Production

Published in

domino-research

5 min readSep 16, 2021

Today we are excited to introduce another tool to support the RegistryOps pattern — 🛂 Checkpoint. Checkpoint provides a “pull request for machine learning” workflow that enables better governance for promoting models to production. Skip ahead to read up on Checkpoint, or check out the open-sourced project on Github.

The case for RegistryOps

A few weeks ago Domino Research shared a post which introduced our current thinking on best practices for managing model deployments.

Briefly, our thesis is:

Model deployments should be managed declaratively, similar to GitOps, and the model registry makes the most sense as the source of truth for storing that configuration.

We dubbed this pattern “RegistryOps” and introduced a tool — 🌉 Bridge — to enable it by syncing AWS SageMaker deployments with configuration specified in MLflow (a popular model registry).

Once the model registry is adopted as the source of truth, governance becomes very important. In this post I will introduce Checkpoint, which adds such a governance layer to your model registry. Checkpoint introduces a process, which we are calling “Promote Requests” (as a nod to the now ubiquitous “PR” used in GitHub), in which:

Data scientists request that a model version be promoted to a deployment stage (e.g. Staging or Production).
Team members can then review this request with changes in parameters and metrics highlighted.
Once approved, the model stage will be updated in the model registry.

Skip ahead to how Checkpoint works, or keep reading for how we got here.

Background on RegistryOps

Declarative infrastructure management (infrastructure as code and/or GitOps) has become increasingly popular over the last decade for traditional software applications. With this approach, all infrastructure and application configuration is defined as code and stored in version control. When code changes are merged, tools like Terraform or Argo CD are used to sync infrastructure to conform with the defined state. This has many advantages over imperative configuration:

The state of the system can be easily inspected by viewing code.
Pull requests can be used to review changes as a team.
Sync operations are idempotent and configuration does not “drift”.
An audit trail of changes (and who made them) is kept.

We believe that model deployments, which are like any other hosted and configured application, should be configured declaratively. This can be achieved using traditional GitOps, but the approach does not seem ideal for a number of reasons:

Source code does not fully capture the information needed to reproduce a model version, such as hyper-parameters and training data.
Data scientists should not have to learn and worry about the particulars of infrastructure configuration when a model version changes. This is unavoidable when submitting to a GitOps repository.
Each model version is produced from a time-consuming and expensive training run, which should only be run once (and before review), unlike traditional software building / testing.
Model metrics are more important than code when choosing to promote a model version.

Given these requirements, we settled on the model registry as an obvious system of record for managing model version deployments. The registry captures the necessary information to fully reproduce a model version, as well as the resulting model artifact and performance metrics. The registry is also a system that data scientists will already be comfortable with, because it is already used for experiment tracking.

Many model registries have the capability for tagging model versions with deployment stages (e.g. Production and Development). This capability enables the aforementioned RegistryOps pattern, wherein the model version deployed to each stage is managed declaratively using the stage tag in the model registry.

With this in mind, we developed 🌉 Bridge, which we introduced in our first post mentioned above. Bridge is like Terraform for RegistryOps, and automatically configures model deployments to match the configuration stored in your model registry.

A framework for model governance

With a tool like Bridge deployed, it is now easy to manage model deployments using the model registry. This, however, makes controlling access to model stage tags essential. An individual updating one of these tags would be similar to a developer pushing directly to themain branch in git.

It makes sense for teams to adopt a process of reviewing these changes, just like they do for code changes, to double check their work and reduce mistakes. A team could adopt the honor system for implementing this process, but there are still some sharp edges:

There is no audit trail of these peer reviews.
Most registries do not have a built in ability to compare model versions in a single view.
Accidental changes can still occur.

For these reasons, we envisioned a tool that would integrate with a model registry and enforce a mechanism similar to pull requests, but tailored to model version management.

🛂 Checkpoint

Enter Checkpoint, an implementation of our proposed governance solution, with current support for MLflow model registries. Checkpoint introduces the concept of a model “promote request”, which is a pull request workflow, but tailored for comparing model versions.

Checkpoint operates as a proxy between MLflow and users. This allows Checkpoint to intercept requests to change model stage tags, redirecting users to create a promote request. In the future, Checkpoint will be able to introduce authentication and role-based approval permissions.

Team member can request that a new version is promoted to production

Once a promote request is created, other team members can view the request and double check the decision. Checkpoint introduces a “model diff” view, comparing any parameters and metrics that may have changed between the two model versions.

A reviewer can either close the request to reject the change, or approve the request, at which point Checkpoint will automatically complete the requested stage change in the model registry. At any time, completed requests can be reviewed to see a history of when model versions were deployed to different stages.

Try It Out!

Like Bridge, we have open-sourced Checkpoint in the Domino Research GitHub Repository. If you already have an MLflow registry, Checkpoint is very easy to try out locally by simply running a Docker container, or you can deploy Checkpoint on Heroku. If you don’t have an MLflow registry, we have also compiled a quick guide to setting up MLflow for local testing.

If you have comments or feature requests, feel free to submit them to our GitHub repository, or contact us @domino_research on Twitter!