CI/CD vs RegistryOps for deploying ML models

Josh Broomberg
domino-research
Published in
2 min readOct 5, 2021

For many years, software developers have taken the concept of CI/CD — continuous integration, continuous deployment — as a basic principle of DevOps. It is widely agreed that you should be able to develop locally until you’re happy with your results and then, without additional manual effort, put your new code into production. Local code may (and should!) be tested, packaged and deployed to staging, prior to production release. But this is automated and uses the exact same code. Nothing is re-written, handed to another team etc.

The problem is that CI/CD does not work for DS/ML use cases:

  • Models need to be ‘rebuilt’ (retrained) when code OR data changes, so triggering is tricky.
  • The ‘build’ (training) process is both more compute intensive and more experimental — not everything that is built is deployed, runs may be paused, adjusted and resumed etc.
  • ‘Tests’ for quality are complex and stochastic and require expert/human judgement. It is never as simple as is this one number greater than X. Perhaps you’re trading off accuracy for fairness. Or accepting worse results to handle a new class.

We think the registry is the solution to these problems. It sits in the middle of two very different worlds:

  • The messy world of local experimentation. This may happen in notebooks or on large clusters. Runs can take seconds or weeks. Ultimately, the only thing that matters is that each unique model version ends up in a registry that stores the code, data (reference), artifacts, parameters, and metrics associated with that specific version.
  • The stable world of production hosting, where consumers querying stable API endpoints to perform inference using the best version of each model.

The registry stores all the information needed to make the complex decision about which model versions to promote. And, once this decision has been made, it has all the artifacts and metadata required to put the model into production. So… it should. This is RegistryOps. The data in your registry is treated as a declarative source of truth for your hosting infra.

If you’re interested in trying RegistryOps for yourself, follow our 4 step guide for continuous model deployment directly from MLflow using Bridge.

--

--