Version Control Of Machine Learning Models In Production

Published in

hipages Engineering

7 min readJul 23, 2020

Introducing hipages’ tool which lets us track which versions of our machine learning pipelines were used to arrive at a prediction

You’ve built your model, and you’re ready for it to be used in primetime, which could mean deploying to your spanking new API serving infrastructure or a new data pipeline. So far, so good. You’ve deployed to production, sat, twiddled fingers, got bored and started playing with the model again. After much coffee, some emotional outbursts, and more than a 5 minute stretch of swearing, you’ve come up with a new feature and you want to A/B test your new model against the old one. So how do you track which model was used for each inference in production?

hipagesgroup/data-tools

Common Python tools and utilities for data engineering, ETL, Exploration, etc. made opensource and packaged, making it…

github.com

Version control in machine learning

Typically when tracking models in production you track the version at each inference. Every time you poke your serving infrastructure with some request, or run your data-pipelines alongside the model’s inference, you’d return some unique identifier which allows you to track how you came by this result.

This unique identifier is typically associated with version control. When version controlling machine learning projects we’re really wanting to answer one simple question: “How the hell did we come up with that result? It’s b̶a̶t̶-̶s̶h̶*̶t̶ ̶c̶r̶a̶z̶y̶ amazingly insightful!”. To answer these riddles, and track back fully through our inference’s origin story, typically we’d want to know:

Which code was used to create the model and prepare our data for training
Our final hyperparameter set
The dataset used to fit and test our model’s performance.

If we’re serving up our model through an API we might also want to track:

The versions of the API serving infrastructure
How our features are prepared in the API (this might differ from how we prepared them in bulk for training)
The configuration properties of our API infrastructure.

DVC is a great tool which helps conquer a lot of the problems laid out above. This tool is targeted at solving some of the problems of version control in the data science world by extending the functionality of git to cover extra elements such as versioning of our training data alongside versions of code.

That’s nice, why not write an article about DVC given its sooooooo good?

When we started the journey at hipages of thinking about version control in our machine learning products our first thought was simple: we’d use DVC to solve all our problems (mainly because it is awesome, and if it wasn’t for the existence of this testing library it would be my favourite thing right now).

Our thinking was that by using DVC to trace the versions of our models we could work back to find their exact lineage. However, we quickly ran into problems. Whenever we committed anything to the project repo we would update the version of that repo, and because of the CI/CD pipeline this flowed through to an update in the version deployed into production. This means that the version of the code being served through our API infrastructure was updated, and caused us problems when we were tracing the performance of a model. These constant changes made tracking the versions of the code we use for inference difficult because we’d need to track the lineage of these changes, and find which updates were important, and would affect our tests, and which didn’t matter.

Let’s take an example. Imagine we’re running a test on a new model and we have a simple project layout such as:

| - README.md
| - setup.py
| - notebooks
    | - super_amazing_model_fitting.ipynb
| - api_infrastructure
    | - serving.py

We assume here that:

DVC has been used to track our model and data lineage
We’ve deployed our (future award winning) model into production, and we’re serving API requests which returns both our inference and the version of our code for tracking our test.
We’re running a test to see how well the model performs against a trained octopus randomly selecting balls from a jar.

Cool, so our test is live. Now our resident data engineer leans over and starts whining about something called tech-debt. Time, she says, to update the README beyond its current detailed entry which comfortingly only contains two words: “DON’T PANIC”.

In this case, if we update the README we’ll want to commit to the master branch of our project. This update will change the overall version of our project, thus making the tracking of our test more difficult because now we need to note somewhere that this change didn’t affect anything to do with our model, or how the inference was made.

Conversely, if we decided to extend our test and investigate a small tweak to a hyperparameter during training we’d like to note this change, and consider this new version of our code for tracking our A/B test.

Thus, it became clear that we’d need to identify and track the specific sections of code base which we’re interested in. We define bits of code as ‘interesting’ if they affect any part of the data processing chain. These interesting components therefore range from the serialized model through to the API serving layers, or data pipelines components. This led us to create an approach which lets us track changes to individual modules within our codebase, and thereby give us a granular view of the versions of the code used in inference.

We therefore built a Python application which can inspect the code to identify classes and methods which have been labelled as requiring version tracking. The code is inspected before runtime to generate a json file containing the latest githashes of the modules of interest. At run time this json file can be loaded and appended with dynamic settings; such as configurations of the model itself, or the hostname of the instance executing the inference. These versions are captured for each inference our model undertakes.

Enough chatter, show me the code!

Here we’re going to show how our tool lets you easily label all the bits of code you want to track by applying simple decorators in Python. Then we’re going to run an application which generates a json file which provides us with our static dependencies, and finally we’re going to create an API endpoint which captures some dynamic dependencies as well as our static ones.

In this example we’re going to use FastAPI (it’s awesome, go play with it immediately after reading this) to create an example of a simple API which takes a number and multiplies it, and in doing so we’re going to track a few different bits of code and log the versioning information.

First let’s create a simple package which contains a class which allows us to apply a multiplier to a number:

Next we use FastAPI to provide us with a simple API endpoint:

Now for the magic!

The folder structure for our example project looks like this:

|- requirements.txt
|- version_tracking_example
    |- __init__.py
    |- serving.py
    |- transformers.py

If we take our example code and initalise a new git repo in this folder:

> git init

Now we’ve create a new git repo, we can add our files:

> git add ./version_tracking_example
> git commit -a -m example_commit

Having previously installed hip-data-tools we can head to the root directory of our project and run:

> pip install hip-data-tools
> version-tracker -p version_tracking_example

Our tool then runs through the codebase looking for the registration decorators, and extracts the latest commits of the files with these decorators in. Finally, it provides us with a json file which captures these hashes. In our case it might look something like this:

{"multiplier_endpoint": "18d019bdb8021a7bee92b51e57a729b5930fa784", "Multiplier": "6735f0884cac12dacbecd40e88218e6037c88d99"}

So we’ve captured our static dependencies into a json file, now let’s use them in our API.

FastAPI can be started from the example’s root path using:

> uvicorn version_tracking_example.serving:app --reload

Having started up we can head off to the swaggerUI (on you local machine you’ll find it here: http://127.0.0.1:8000/docs), and see what we get if we call the endpoint.

The result of our call is:

{
  "multiplied_result": 2,
  "versions": {
    "multiplier_endpoint": "18d019bdb8021a7bee92b51e57a729b5930fa784",
    "Multiplier": "6735f0884cac12dacbecd40e88218e6037c88d99",
    "multiplier_value": 5,
    "aggregated_version": "2b11bf0bb0c8fd48c1c7291e83d46803",
    "hostname": "AHost_v1",
    "versioning_timestamp": "2020-07-21T07:58:54.495278"
  }
}

Note here we’ve got a few extra bits of information we didn’t have in our static versions. The extra information gives us a little more insight about how the inference was made.

The aggregated version gives us a single hash which captures all the versions which relate to the code — so all of the configurations, objects and software versions. It excludes the versioning timestamp and hostname. This version hash is useful because it gives us a way of tracking everything we’re interested in with one identifier.

How do we add models?

The versioning library includes a way of getting a consistent hash for any serialised object. So to add a model version all we do is load the model from persistent storage, and then add our version tracker:

What’s next?

In this example we’ve returned the versioning information with the API response. In fact we found that this just made the microservice contracts somewhat confusing. In our production infrastructure we actually pass the result of this version control to a Kafka producer, and asynchronously capture all versioning information for every inference. But … that is for another day.