Retracing your steps in Machine Learning: Versioning

André Targino
The Launchpad
Published in
9 min readSep 13, 2018

Anyone who has tried to build a machine learning model has seen first-hand how fragile a new prediction system can be in dealing with any changes that are thrown at it. In the course of running an ML experiment, I have seen companies experience a dramatic drop in the accuracy of the prediction model (e.g. from above 90% to below 20%). The troubleshooting process was long, and while the company knew what they had changed, they didn’t know the root cause of the performance drop. It’s not uncommon for a seemingly incremental change to cause the model to be effectively useless at making predictions.

Because most newly-built models are fragile, one of the most critical early steps for any company building an ML product or feature is to build a robust system for versioning the key components of the system. Versioning these components of an ML system helps to easily retrace steps in the event that the fragile model breaks, which happens more often than most people talk about.

Versioning in traditional (if/then) software development is very similar to a “save as” function in a word document. You can make a change, and then save a copy of the new document. If something breaks after you make the change, you can revert back to a previous version. While there are many things that can be changed in the code, it’s relatively straightforward to change only one at a time and move incrementally forward.

When you apply ML, though, versioning is much closer to a “save as” function in a family of spreadsheets with many dependencies and formulas. Seemingly small changes that you make to one part (like adding a row or slightly changing a format in one of the sheets) can break a subsequent link in the chain, and propagate through the rest of the system.

At Portal Telemedicina, we have developed a process of versioning which has dramatically increased the rate at which we are able to run machine learning experiments, interpret the results, plan future experiments, and deploy models in production. We believe this framework will be useful for any company applying machine learning. In this post, we cover what ML versioning entails, why it is important, and the improvements we have made since standardizing this process.

There are lots of different paths for ML experiments (Image Source)

What goes into ML Versioning?

There are several components of an ML system that need versioning infrastructure to streamline the development process. For simplicity, we have broken them down into several categories:

  • Code (architecture) — this is one of the main ingredients in building the trained model. It determines the type of model used (for example, Inception vs. Visual Geometry Group, or the preprocessing steps of the input pipeline). Each of these components (which includes all the frameworks you are using like kubernetes and docker) can and should be adjusted to find the optimal performance of the system. Every time a change is implemented, a new “version” of the code should be created and frozen until it is tested. Such a versioning system should account for the individual versions of each component as well.
  • Data (training) — changes in the training data (discussed in a previous Lever post) can have profound effects in the way that the model behaves, and therefore the training data set also needs to be labeled with a version number.
  • Data (validation) — validation data is useful to be able to tell when to stop training the model to prevent overfitting to the training data. Overfitting is your worst nightmare in machine learning, since the whole point of your endeavor is to make a system that works well when it is launched on data it hasn’t seen before… and overfitting means it’ll be far better at old data than new data, much like a student memorizing answers instead of understanding general principles. Such a student gets a good grade on problems they’ve seen before, but is no good on the job. The nuance of validation data sets is beyond the scope of this post, but it is important to version them since they are an important tool for tuning the model.
  • Other “minor” parameters — there are many other parameters that go into building an ML model, including things like learning rate, padding, and kernel size. If the architecture and training data are the meat and potatoes of the dish, these ingredients are the salt and pepper — they are very important, even if they are not the main attraction, and they can still ruin a dish if applied improperly. Versioning makes it possible to keep track of what was done in the past to learn for future experiments.

These 4 ingredients come together to produce a machine learning Model, which is a fancy name for a piece of software that predicts or classifies things. During the development process, you might change any of the above “ingredients’, which will yield a slightly different model (in the same way that different ingredients used during cooking produce slightly different dishes). If you keep all the ingredients the same, but change the code (architecture), you will end up with a different model, and that different model should have a different version number — ideally which encapsulates the ingredients used to build it.

Once you have your model, version number and ingredients, it’s time to taste the dish and see how you did. This is accomplished using a Testing Data Set, which is another core component which should be versioned. Since the whole point of a ML model is being able to provide insight about environments which it has never seen before, a testing data set takes data that are truly new to the model and tests how successful the model is at predicting the right answer. This testing data set is meant to be an accurate representation of what the model will experience in the real world. Over time, startups increase the size of their testing data to more accurately represent the messy reality, which can dramatically change the results of the model.

The last step in the experiment is to record the performance of the model (which is built from all the ingredients) on one particular testing data set (complete with version number). The performance of the model will be based on several result metrics which are dependent on goal of the system.

Below is a very basic example of a series of experiments made by taking different ingredients (like code and training data) to create models, which are given new data in the form of testing data, and which provide a set of results. Even a small change to the system can yield a dramatic change in model output, hence the need for rigorous version control for all these elements.

Sample model tracker with versions and key results recorded

How to implement ML Versioning?

Code (architecture) — Tools for versioning code are the most mature, since we have been using them the longest. There are several companies like GitHub and Bitbucket, that provide you with the infrastructure to version your code. Through pull requests and merges teams can drastically reduce the amount of problems and time spent on software development.

Data (training, validation and test) — While there are some commercial tools available to version datasets, many systems like SQL Server, MongoDB, and Oracle are not ideal tools for ML systems. Since there are many independent pieces which need to be versioned separately (as opposed to one monolithic database), using these tools creates many manual processes for tracking and controlling versions of datasets.

Other ingredients — There are a lot of other ingredients which go into making a model work the way it does. Although they are beyond the scope of this post, it is important to integrate this into the versioning system.

Model — The good news is that if you’ve successfully versioned the code, datasets, and other ingredients, you’ve done all the work of versioning the model which is the result. The only thing left to do is to give the model a version number which can be linked back to the constituent ingredients. Google provides a good model versioning tool through their Cloud ML Engine — it organizes everything providing you with model and version resources.

Metrics/results — The metrics and results are the place where you benchmark the performance of different models with a given testing data set. The selection of what these metrics are is extremely important, and depends on the goal of the ML system being used.

Portal’s model tracker in GCP ML Engine

With the above infrastructure in place, you can start building tools to streamline parts of the ML experiment process. An example of this is linking your code to your model through tags in the commits (sets of related changes in a repository), which allows you to see the exact code that generated any specific model. Another advantage which comes from a versioned ML system is that making simple modifications on a previous model becomes very fast — simply pull the ingredients out of the library of versions you had produced in the past to create a new model. Finally, and perhaps most importantly, a well-versioned pipeline can allow teams to create tools to automate parts of the process like implementing a parameter tuning approach that tries multiple experiments automatically, as well as an auto tagging or CI/CD for your ML pipeline.

Similarly, on the data side, versioning allows you to link your model to the data through labels in the deployment (models that are ready to be used). This way you can know the exact data that was used to train your model which will allow your data science team to see the impact of data on the model performance. You can also automate relevant parts of the process like your pipeline to prepare and consume different versions of datasets and keep track of everything.

Why ML versioning is important

In summary, ML versioning important because it:

  1. Makes troubleshooting much more straightforward: a versioned ML system is much easier to troubleshoot since differences between experiments are much more easily identifiable (e.g. changing the amount or type of data used for training). As with the spreadsheet vs. document example in the introduction, the differences between models can be checked through git diff, i.e., a concise way of checking the exact changes between the codes of each model instead of having a tedious and error-prone process of looking for how the models were generated.
  2. Provides transparency about what produced the model: You will be able to avoid many problems like feedback loops (e.g. a recommender system that is trained using data generated by people who accepted its own recommendations) in a much easier way since you know the exact constituents that were used to create a specific model.
  3. Allows for faster experimentation: Speed is the name of the game for startups, and a robust versioning system allows for much faster and intelligent experimentation vs. random, unsystematic changes. Using logging and visualization tools like Stackdriver and Tableau, when combined with a standardized versioning system, can greatly increase the speed of experimentation.

We learned many things in the process of moving from a startup that had a few prototypes to a company that delivers reliable AI solutions. The one that stands out the most is that ML versioning deeply impacts the speed of the AI progress in your startup.

Good luck!

Andre Targino is the Lead AI Researcher at Portal. He has a Ph.D. in Electrical and Computer Engineering from the University of Illinois at Urbana-Champaign, broad international experience including several startups and 15 years of experience in Artificial Intelligence, Data Science, and Image Processing.

Portal Telemedicina connects specialist physicians with healthcare providers, by allowing them to diagnose online. Using the automations in our ML-powered platform, our doctors can diagnose up to 100 exams per hour, carrying out thousands of diagnostics per day, more than any hospital in the world. Our mission is to guarantee universal access to quality healthcare, by using Telemedicine to overcome geographical barriers. Portal graduated from the second class of Google Developers Launchpad Studio, focused on the applications of ML in healthcare and biotechnology.

--

--