🌉 Bridge — bridging the gap between model registries and production hosting

Josh Broomberg
domino-research
Published in
5 min readAug 13, 2021
Meric Dagli / Unsplash

The Domino R&D team is open-sourcing Bridge, a tool that turns your model registry into the declarative source-of-truth for your model deployment and hosting.

With Bridge:

  • Data scientists manage the lifecycle of their models exclusively through the built-for-purpose API and user interface of their Model registry.
  • DevOps and machine learning engineering teams use Bridge to automate the often complex and frustrating management of model hosting resources.

Below, we share the research insights that led us to declarative ‘RegistryOps’ as the right way to structure the path from model development to production.

(Tl;dr? Check out this 7 min loom demo of Bridge in action)

Hundreds of interviews, two observations

Over the last year, we’ve spoken to hundreds of data science and machine learning teams across organizations of all sizes. Synthesizing these conversations, two things become clear:

Registries (not repos) are the source-of-truth

Most deployed software systems use a code repository (usually a git repo) as their source-of-truth. If your serving infrastructure was completely destroyed, you’d still be able to rebuild and redeploy your application using just the source code in your repo. CI/CD systems — the gold-standard for safe, scalable software deployment — bake this deeply into their design — on a push to your repo, artifacts are rebuilt, and then these artifacts are deployed for hosting.

Code-repo-as-source-of-truth does not work for model development for three reasons:

  1. Model development is creative/ad-hoc: building and improving a model requires experimenting with different features, architectures, hyper-parameter values etc. This work is best performed in notebooks, which don’t play nicely with git. Even if experimental code is git-versioned, one set of code may generate multiple candidate model versions. These versions are defined by artifacts, metrics and parameters that are hard to organize and store as plain text for git versioning. Unlike regular software dev, the most recent version is very rarely the best version.
  2. Data is as important as code, and changes often: the same modeling code will produce different outputs when run using different data. Data changes frequently across time, so even if code is untouched the model will evolve over time in a way that will not be reflected in a git repo.
  3. Compute is significant and stochastic: for regular software, the compute required to build artifacts is an after-thought because it is relatively small/cheap and mostly deterministic. In model development, compute is expensive and the same inputs can produce different outputs due to stochastic optimization. This means it is crucial to store the results of each compute run as opposed to treating them as disposable.

Model registries address these issues by storing ‘model versions’ that capture the inputs (code, parameters, data) and outputs (artifacts, metrics) of compute regardless of where it happens. If your production infrastructure was destroyed, the content of your registry (your best model version), and not the content of your git repo, would be what you needed to redeploy your model.

Based on our research, we expect the vast majority of data science teams to adopt model registries as the core pillar of their model development process.

Model lifecycles must be de-coupled from app lifecycles

The vast majority of models are consumed as part of a broader application/service. The exact integration pattern depends on the scale and latency requirements of the application, teams do everything from loading model artifacts into application or database memory so they can be called locally to running asynchronous batch processes that read and write data in the database.

Regardless of how consumption happens, it is important to separate the lifecycle of the consuming application from the model itself because these change on different timescales and with different requirements. In the vast majority of teams, the app runtime and results consumption logic is relatively stable, but the model must be updated frequently as a result of data drift and new, improved model versions. This is true even without considering more advanced patterns like shadow and A/B testing, which require prediction-by-prediction changes to the model running in the same context.

The upshot is that high-quality model deployment patterns must allow models to be updated independently from the lifecycle of the calling code.

The Problem for Modeling Teams

The two observations above leave teams stuck between a rock and a hard place when it comes to getting models from development to production. We see two patterns:

  • The modeling team maintains separate hosting/runtime infrastructure for models, typically microservice-like APIs or batch jobs. This achieves model-app independence but creates an infrastructure management burden that distracts from model development. Advanced teams may trigger these updates based on web hooks from their model registry but this requires DevOps config that necessitates at least one if not more full time engineers.
  • The modeling team does not maintain separate infrastructure. Models are bundled into the consuming application using the regular build process. This is done by fetching the correct model from the registry during the application build and deploy. In this pattern, it is hard to update the model independently of the consuming app. Unless the app is configured with true Continuous Deployment, the modeling team relies on the support of a shared DevOps team to get model updates into production.

In both scenarios, modeling teams are forced to reason about DevOps concerns that have little to do with their core model-development job. Worse, the model registry — which should be the source-of-truth for which model version is running in production — is likely to be out of sync with the hosting infrastructure. If a manual deployment is not made or a pipeline not triggered, changes to the registry will not reflect in reality.

Introducing Bridge

Bridge is designed to enable declarative model management, with your model registry as the source of truth. It is a lightweight tool that runs via an installed CLI or in a Docker container, watching a specified model registry and updating target infrastructure as changes occur.

With this ‘RegistryOps’ approach:

  • Data scientists manage the lifecycle of their models exclusively through the built-for-purpose API and user interface of their Model registry. Stage labels (dev/staging/prod/etc) in the registry become a declarative specification of which models should be deployed and to which environment they should be deployed.
  • DevOps and machine learning engineering teams use Bridge to automate the often complex and frustrating management of hosting resources. You manage Bridge, Bridge herds the infrastructure cats and manages the repetitive wrapper code.
  • Both teams can be confident that the models tagged in the registry are the models being served, without having to dig through git and CI logs or worrying about keeping things up to date manually

Try the quick start — you’ll be deploying from your existing registry in under 10 minutes. If you’d like to setup MLflow to test Bridge out, check out our 5 minute guide to setting up MLflow

We’ve started with MLFlow and SageMaker because these are tools used by a number of teams today and a great place for new teams to start out. But we’ve also designed Bridge so that the functionality above can be extended with support for different registries and different deployment targets.

If you have an idea or encounter an issue, open an issue. We look forward to hearing from you!

--

--