ML Ops: Data Science Version Control

Data versioning primer for model, data and code.

Machine learning is an iterative process. Taking a top down vertical approach, there are so many different aspects of your model, usage of data, hyper-parameters, parameters, algorithm choice and system architecture, the optimal combination of all of those aspects is the actual solution to the problem the team is trying to solve. Finding the right balance across these aspects is trial and error. The most talented professionals in the data science field still need to tweak all those aspects to get to the final solution.

Each of these tenants can and will change. Hence, while looking at versioning a machine learning system, we have to think about model, data and code respectively.

Let us understand each component from an operational perspective to come up with an ideology for data versioning. ML Operations is a relatively nascent field. Hence, instead of taking absolutes, article aims to help any data team to come up with their strategy for data science version control.

Photo by Michał Lis on Unsplash

The Code

Since engineers are more familiar with code, let us start with the code first. There will be two distinct types of code in case of a ML system, implementation code and modelling code. Implementation code could be glue code, code leveraged to build APIs or system integration code which connects the core ML system to the actual application. Modelling code is the code which is used for model development. I have seen scenarios where the implementation code and modelling code are in different languages. This is a function of how data science teams are organized within an organization.

On the aspects of scale, infrastructure updates could happen after deployment of the system. In that scenario, we could be using the concept of Infrastructure as Code (IaC). This becomes paramount since “Infrastructure as Code” evolved to solve the problem of environment drift in the release pipeline. Without IaC, teams could also a maintain the settings of individual deployment environments.

Most compiled machine learning artifacts have code dependencies and so maintaining the dependencies becomes just as important as maintaining the model.

The versioning process should take care of these scenarios described above. To what degree these aspects are resolved could be decided by the team based on the team maturity, ML product maturity and flexibility requirements of the team.

The Data

Scenarios change. Technology changes. Model changes. There is no reason to expect that our data will always look the same. We must always be retraining our models as well. Hence, the data tenant is critical and needs versioning.

When we look closely at data, different aspects of data could emerge. Metadata (data about data) and values (actual values of the records) are the two main aspects which need to considered.

Metadata tells us about the nature of a certain column(or feature) in our dataset. In a more traditional database world, this tells us if the value is string, number, boolean or a decimal value. Sometimes, metadata could change without the underlying data. A string could be defined such that it stores just numbers. However, it is extremely important to consume this data into the model in the appropriate format, hence the metadata needs to be tightly versioned with both model and the code.

The team should be able to retrain a historic model based on the data values from the same version. This becomes more important if we discover a fault in the updated model which is already deployed to production and needs a rollback. To successful deploy the older model, it should be clear that we need to be able to retrain a model on the data as it is reflected at any point in the past, and consequently we should think of the actual values of our dataset as a version axis.

We could link the data values to their respective metadata on a particular version. This helps the team to estimate missing values in a dataset as well as set default values. It is also easy for teams to query the entire table from the data lake to fetch the dataset to be fed into model training. On the flip side, if they are tightly coupled, we cannot make incremental changes just to the metadata or the values without both of them being in lockstep. This brings its own set of problems.

If we do not link data values and the metadata, we could write functions which can statistically set default values. In scenarios where there is a metadata version mismatch, we can also write transformer functions which will be able to handle changes to metadata and values gracefully. Hence, within data, metadata and values could have their own versions. Again, the power lies with the team to decide the best approach that would be a fit.

The Model

Model building is an iterative process. We may want to gradually improve the accuracy or the area of application for the model. Hence, the model is ever evolving like the code and should be treated that way. The choice of model (and its hyperparameters) should have their own versioning.

When it comes to versioning, we should treat the model similar to how we treat code. We also mentioned model development code previously which would be versioned like normal code would be. With this approach, you would be able to rollback models in the same way you would for code.

If the model is coupled with metadata and values there maybe some drift due to one or more elements. Those issues can be addressed by moving the versions along the other elements appropriately. If the model is coupled tightly with the code, every time we version the model, we need to have the appropriate code version to go along with it. The tight coupling could resulting in modelling getting intertwined in coding sprints.

So far, we have a version for every iteration of code, model, metadata and values. Let us also consider the aspects where code, model and data (metadata and values) intersect with each other.

Intersections between code, models and data

We will usually have transformers and fetch functions build around data. Code and data will be intertwined in this regard. In these areas, it is important to treat these elements as a cohesive piece. Usually, this can be achieved by using config files which carry specifications of version for the data. The config files will be versioned with the code as well. This does mean that each version of data needs to have a snapshot saved for it to be used whenever needed. This will be a data heavy approach in that regard. Model and code are already intertwined and discussed in the model section.

Tools

Choosing the right tools will help the team align on different strategies. An open source tool called DVC (Data Science Version Control) provides similar semantics to Git for versioning different data components. For tracking experiments as well, in DVC, one can use different Git branches to track the different experiments in source control. Pachyderm uses containers to execute the different steps of the pipeline and also solves the data versioning and data provenance issues by tracking data commits and optimizing the pipeline. MLflow Projects defines a file format to specify the environment and the steps of the pipeline, and provides both an API and a CLI tool to run the project locally or remotely. In organization where the languages are different, the team can use a tool like H2O to export the model as a POJO in a JAR Java library, which you can then add as a dependency in your application.

Choosing the right strategy

Depending on the team maturity and product maturity, the data science manager or leader should build a roadmap for versioning of data science components. It is extremely important to choose a strategy and stick with it rather than continuously changing the roadmap as this is versioning. Anyone from the software engineering world is aware of the issues with continuous changes to versioning process. The value with the data science versioning strategy can only be realized when you stick with a plan and see it through. Plans abandoned or pivoted frequently will result in chaos of the version (version hell) and could result in lost of productivity.

Conclusion

The notion of a “version” of a machine learning application has (at least) four possible axes on which it can drift. This poses a challenge in continuous delivery practices.

In the end, keeping an eye on continuous delivery principles is important. Data scientists should be integrated into delivery teams and developers should be integrated into the data science teams to achieve a cohesive vision. Treating machine learning like functional software and not a black box method is the right mindset to start version control for data science teams.

Subscribe to our Acing Data Science newsletter for more such content.

Thanks for reading! 😊 If you enjoyed it, test how many times can you hit 👏 in 5 seconds. It’s great cardio for your fingers AND will help other people see the story.

Engineering Manager | Founder of Acing AI