Continuous Integration for ML Projects

Published in

Onfido Product and Tech

5 min readOct 30, 2017

Over last year we have deployed quite a few services which contain Machine Learning components. This post shares what we learnt in that process and what helped us to minimise risks and time to production.

What does our usual development cycle look like?

We use many different programming languages at Onfido but the most common ones are Ruby, Python, Elixir and Javascript. The development process of each service/application is quite unified:

Use some version of Gitflow to manage branches
Containerize application using Docker
Deploy application to Kubernetes cluster

It’s important to mention that Docker is not only our choice of packaging format but also our development environment. This helps us minimise the risk of “works on my machine” situations and version dependency management (especially useful for science libraries in python).

Jenkins is our continuous integration system of choice and we use it for building and deploying applications. All of our repos contain a Jenkinsfile which defines the pipeline that Jenkins will execute. Most common steps of the pipeline are:

Build a Docker image
Run our unit/integration test (within a instance of that Docker image)
Run acceptance tests (end-to-end, may require some orchestration)
Deploy to staging and production

Every time developer pushes their code to remote git server, Jenkins reads the pipeline file and follows the appropriate steps. Having this pipeline define for each repository has proven to give us a lot of flexibility to each service steps.

Here is a visualization of these processes:

Enter Machine Learning in the service

Adding machine learning components to new or existing services means that now we need to resolve a few things:

How do associate the code and (usually) large files needed for the models?
How can we increase confidence on changes to models or inference-related code?
Where does training fit into our development lifecycle?

To answer the first question, one of our engineers wrote about the approach we took. To summarise: our models live in S3 and are linked to the code using a dependency file, which is easily staged in the Git repo.

Testing models before deploying

As we do with all other software, before we release changes to ML models we want to have a certain degree of confidence that our changes have not negatively impacted how our system behaves (at least in unexpected ways). But how can we be confident that our models perform as we expect them?

The solution we found is to introduce a new type of test in our test suite: accuracy tests.

An accuracy test exercises the inference code for a test data sample for a model to verify that the output is above an expected threshold.

If you remember our CI lifecycle above, we made the following changes:

Allow docker image building step to resolve the model dependencies
Run unit/integration tests (fast to fail)
Run acceptance test (usually slower than the previous set)
Download the test dataset (currently using S3 to store this information)
Trigger the accuracy tests (speed can vary greatly depending on hardware, sample size, etc.)

With this setup, we have a high degree of confidence when making changes in our services that the performance of models have been unaffected — enabling us to move a lot faster.

As this runs inside a container, we can also run these tests locally.

In some scenarios, we have had to make different compromises:

For services using small and fast inference models, we can use small (but still statistically significant) test datasets which — with some parallelising — can run on every build and finish in seconds/minutes
For other services using models with higher resource needs/slower inference time, we chose to run them on a scheduled basis on integration branches. This balances time for feedback with certainty that we’ll still catch regressions before they reach production

At this point we have a system that allows us to add new models, make change with confidence and deploy to production in a streamlined way. The pipeline looks similar to before:

Pipeline after the introduction of ML models

Fitting training on this flow

From the perspective of the CI pipeline, adding new models can simply fit in modifying the code and dependencies files we mentioned, and we are good to go.

One thing that we are interested in systematically training new models, lowering the knowledge barrier to do so and ultimately making this process automated. So far what we have found helps create a process is:

Move all code required for training into the same git repository
Use a dedicated docker image for training
Structuring your training steps with libraries like Luigi or Airflow, makes it a lot simpler to refactor later on, apart other goodies like ability to resume on a failed step.

Moving code into the same repo meant that we can use the accuracy test as one of the steps in the pipeline, and manage code share (at least to a certain degree).

Docker was a natural decision based on our existing workflows, and provided us with the flexibility to:

Run training locally, specially during early stages of the project.
Ensure the right dependencies are installed (images will vary if you need to use GPUs or CPUs between inference and training).
Being able to take advantage of services like AWS Batch, which can handle all the infrastructure management (including GPU nodes) with little to no effort.

We will soon be writing more into this specific topic since there is a lot more learning from continuously train models with little to none human intervention.

Summary

Handling ML is not too much different to other components in your systems, but it requires solving some particular problems to get you going.

I believe that it’s extremely important to streamline this process right from the start and setting up CI that combines all these components will make it a lot easier to keep growing your solutions and improving them in production.