Continuous Delivery for ML Models

TL;DR: if you have decided ML is part of your core software, focusing your efforts on automating the process around this technology is as important as TDD and CI/CD.

At Onfido we use ML to better serve our customers and provide a more seamless and smooth experience to verify people’s identity.

Since ML is at the core of our business, we have made great efforts to allow the development, training, deployment, and maintenance of our ML services to be as routine as with any other less complex service. Here is an overview of some of the components and strategies that make it possible.

It all starts with gathering data

One of the first and most important steps in order to solve problems with ML is to have a constant stream of quality data, which our research team can use to develop innovative solutions to existing problems. For this reason, we have created a robust self-service data pipeline along with a set of utility tools that make gathering and building datasets a much faster and simpler task.

Continually re-train your models

In our experience, once an ML service has been integrated with a customer-facing production feature (even if it’s alpha or beta) we need to be able to ensure we can sustain it through time. One part of sustaining a model is the ability to retrain it or even recreate it from scratch.

For this purpose we have created a pipeline structure that leverages Docker, Jenkins, Luigi and AWS Batch which is capable of training models with minimal to no human intervention.

Here is a basic explanation on why we chose each:

  • Luigi: this tool provides an elegant way to build complex pipelines, define dependencies between their steps and handle workflow management. Given some steps in our pipeline are computationally expensive (training with TensorFlow on GPU or preprocessing very large datasets), the ability to resume the pipeline from checkpoints is great.
  • Docker: this is our most used tool to package dependencies and code in a standard way. Along with the service code we also ship a specific training Dockerfile and pipeline.
  • Jenkins: Used mostly to kick off training in batch with specific parameters.
  • AWS Batch: We were looking for a self managed clustering mechanism to run our docker image with GPUs capabilities, and this was the perfect fit.

The training phase takes the following steps:

  1. An engineer will set a parametrised build in a Jenkins job to kick off the process
  2. The Jenkins job will build a docker image (if necessary) with the new code.
  3. With some of the parameters in (1) and the docker image built in (2) we are ready to kick off training in AWS Batch. At this stage the batch job can do all that’s needed without human intervention.
  4. Once finished, Batch will store accuracy test results (showing how good the model is) and the associated models in s3.

At this stage, we are using this particular flow and setup as a template for different teams to use in order to mainstream the training process.

Deployment of ML services to production

Now that we have a service that can be trained with just a few clicks, the model is wrapped into an API which implements the following service template:

  • HTTP API: usually Flask on top of Gunicorn and gevent
  • Monitoring: Datadog and Datadog Trace
  • Container definition using Dockerfile
  • Container orchestration configuration using Kubernetes templates

Jenkins takes care of combining the code, bundling the ML models (more info in this post about CI for ML) and then triggering deployment to Kubernetes.

On k8s we also use CPU-based horizontal autoscalers to handle variable traffic, since most of these services are CPU-bound.

Service request metrics in Datadog

As for monitoring, we have been using DataDog for live service metrics and alerts, which has been crucial to understanding how these models behave in real production use cases.

An example where we use this information is for evaluating new services before we push them into general availability. We usually run new services in “ghost mode” in production — passing them real traffic, but not returning their results — in order to collect metrics on how they behave:

  • system (latency, CPU and memory usage)
  • product (‘is inference as accurate as we expected?’)

This lets us define an more appropriate base line of resources, test our model against real data, and ultimately provides a much smoother rollout.

Wrapping up

ML is a big part of what we do at Onfido, and we are working towards making it as commonplace as any of the other tools in our toolbox. So far this has meant identifying the main complexities of sustainable production ML and developing a set of mechanisms and practices to help us move forward.

These set of practices took us from a release cycle that lasted multiple weeks for just a handful of existing models, to being able to train, test and deploy several models a day.

I hope this description of our pipeline can help you start making your ML releases as routine as ours are!


If you’re interested in helping us solve these challenges, please take a look at our open positions!