Rapid & Reliable ML Experiments using MLOps Best Practices.
Context:
Machine learning model development can be messy, if we don’t follow a structured process during Model building. The choices are plenty when we start solving a business problem using Machine learning.
A data scientist has to choose from various permutations and combinations of data, features, parameters, hyper-parameters, metrics, loss functions, algorithms. This necessitates a series of ML experiments with different choices of these moving parts, as well as comparison and evaluation of the experiments performed, before making the final model selection decision.
Better and efficient management of this initial series of experiments in ML experimentation life cycle forms foundation to subsequent phases of MLOps such as model serving, monitoring, re-training.
In this blog, we will touch upon a few of the key questions related to MLOps best practices in ML experimentation life cycle and demonstrate how we can implement those best practices using popular open source tools and libraries related to model lifecycle management, configuration management and data versioning.
- How to log and manage various moving parts of ML experiments using a typical Model life cycle management library?
- How to maintain various configurations of ML model development using a configuration management library such as YAML Library?
- How to manage and collaborate with different versions of large data files using a data versioning library such as DVC?
Logging Model Parameters using a Model Lifecycle Management Library:
Every ML experiment, in model development lifecycle is associated with its respective version of data, parameters, hyper-parameters, metrics, model output artefacts. Traditionally, many data scientists capture all these experiment logs & results in a spreadsheet, like the below one and compare logs to decide on the best tuning parameters.
This approach has a few fundamental limitations. Logs are captured manually. So, it is prone to error. It is difficult to collaborate and share experiment results reliably. This approach is not feasible, when the scale of experimentation becomes very large.
Now we will see how we can make use of any popular open-source Model Lifecycle Management library to overcome the limitations of spreadsheet logging with a just a few lines of code. We can leverage Experimentation Logging module of any typical model lifecycle management library, and can log, query and manage various ML experiment runs and eventually do away with manual spreadsheet logging
Experimentation logging module of a standard model life cycle management library provides easy-to-use functions to record and log various information associated with model development experiments such as code version, time of experiment, model parameters, metrics & model artifacts (model objects, metrics visualization plots, output data files ). These loggings can be recorded with just a few lines of codes, as shown below.
# Logging Parameters using a model lifecycle management library
# Log parameters, which are arbitrary key-value pairs
import any_typical_model_management_libray as mlmmlm.log_model_parameter("num_dimensions", 8)
mlm.log_model_parameter("regularization", 0.1)# Log metrics related to the experiment
mlm.log_model_metric("accuracy", 0.8)
mlm.log_model_metric("r2", 0.4) # Log artifacts associated with the experiment
mlm.log_model_artifact("precision_recall.png")
API calls, like the above ones, can be inserted anywhere in code where user is interested in recording parameters. These API calls log the parameters associated with the experiments to a directory(Local or any other remotely configured directory). This allows data scientists to collaborate using a centralized tracking server and compare results from multiple data scientists in the team.
Once users have completed logging parameters, Model lifecycle management libraries can also be used to query, compare and contrast experiment logs through a simple intuitive web-based UI, similar to the one shown in the below diagram.
To summarize, in this section we touched upon the traditional way of logging model parameters using a spreadsheet, the fundamental limitations of this approach and how we can overcome these limitations leveraging any standard off the shelf open source model lifecycle management library with just a few lines of codes.
Model Configuration Management using YAML Library:
Machine Learning systems are often associated with varieties of configurable options such as: training data location, input or output file locations, algorithm-specific settings . We often see data scientists using hard coded values for these configurable parameters at multiple places in the script.
Data science scripts with hard coded values for these configurable options are not easy to maintain and iterate over. Because, with hard coded values in scripts there is no one central place to refer to when we have to change parameter values. Data scientists have to browse through multiple files (training.py, data_prep.py, inference.py ) to identify and update the parameters in interest. This approach is prone to manual errors, omissions, oversights and eventually results in poor development experience for data scientists.
In this section, we will demonstrate how we can do away with the practice of having disparate hard coded values, to have a systematic and scalable approach to model configuration management using YAML library.
Step#1 ML Configuration Management:
The initial step is to create a stand-alone parameters.yaml like the below one in your workspace. This file captures values for all configurable parameters in an ML experiment and acts as a central configuration store for the data science project.
Step#2 ML Configuration Management:
Once we have a central parameters.yaml file in place from Step#1, we can retrieve required parameters in training, inference, data preparation or any other scripts using commands similar to the one outlined below. And, if we need to perform a new experiment with different parameter values, we just need to change parameter values in the central configuration file parameters.yaml.
In the preceding sections, we talked about two key operational challenge during ML experimentation life cycle — ML experimentation logging and ML Configuration management. In the next section, we will talk about how to keep track of data versioning during model development.
Versioning Data like Code using Data Versioning Library — DVC:
In software engineering projects, it is primarily the code that goes through multiple versions during the course of the projects. And, GIT is the go-to choice for software engineering teams to effectively manage and collaborate on these multiple versions of code.
However, in data science projects, in addition to code, data files (both model objects & training data) go through multiple versions during the course of the work. Although we leverage Git to manage data science source code versioning, it is not effective and is not designed for versioning large data files associated with data science source codes.
So, it is common to see data scientists taking recourse to manual versioning & sharing of data files with version specific naming conventions (e.g. model.pkl, model_log_reg.pkl, data_train_v2.csv, data_train_features_v1.csv), when they have to collaborate with peers on large data sets and model objects. This manual versioning and management of data can be tedious. Data versioning library — DVC is an open-source tool, essentially designed to solve this problem by making versioning of large data files easier and seamless. This eventually helps data scientists to focus on their core model building effort without any manual overhead on data versioning.
DVC is built on the core design principle of separating out code storage from data storage. Codes are stored in Git/code server, while the data associated with the corresponding code versions are stored in a separate data server. The data server could be any location specified by the user — Cloud storage location or any remote self-managed storage location etc. Thus, DVC provides much needed flexibility around where the user wants to keep data files.
DVC stores actual data in the specified data server (A separate location from code storage) and maintains a meta file about the data to be tracked (Its storage location, version etc.) in a DVC data meta file. DVC puts these data meta files in the code server (Git). So, every code commit to Git has its corresponding DVC data meta file — meta data about the actual data. This associated data meta file, in turn makes it possible to retrieve the actual data corresponding to a code version even when we do not store the actual data in the Git code server. The below diagram summarizes this core DVC design principle, that we just discussed.
Now we will go through the detailed steps on how we can put this data versioning concept into practice using DVC library. Good part is, it is built on top of Git and has syntax very similar to Git.
Step#1 :DVC Library - Initialize Data Versioning:
The first step to start using DVC for data versioning, is to initialize DVC system inside a Git repo. To achieve that, first get inside the Git repo where you want to implement data versioning . Then, execute below command after successful installation of DVC in your workspace.
The above command, crates a set of internal files required for DVC to implement data versioning. These files need to be committed to Git to complete data versioning set-up in your code repo. In the below highlighted step, we commit the required DVC internal files to Git server.
Step#2: DVC Library — Set up Remote Data Storage Location
In previous step, we set up data versioning capability in our workspace. Now, we will configure a default storage location for data files using DVC commands mentioned below. In accordance with DVC design principle, the data storage location needs to be a different location from the code server (Git server) where codes are stored.
After we have configured a remote storage for data files in above step, we can add and check in data files to the DVC storage using below DVC commands. This data check-in, check out, commit process in DVC, is analogous to the way we check in codes to Git server.
In this section, we saw how to put the core DVC principle (separating code storage from data storage) into practice. Actual data is stored in a different location, but the information about where the data is stored is added to Git code version control system in the form of a DVC data metafile. The DVC data metafile helps versioning large data files using Git, even if these large files are not stored in the Git code server. The below diagram illustrates this concept visually.
Step#3: DVC Library — Check-in Check out Large Data Files
In previous steps, we discussed about how we can initialize data version control using DVC library in our Git workspace. We also discussed methods to set up remote storage location, where we store data files such as large data sets and large trained model objects.
In this step, we will explore DVC commands to retrieve required version of data into our workspace as and when required.
Here, git checkout command helps us switch to the specified commit branch and brings in the associated DVC metafile (The file that has info about where the actual data is stored) and source-code to the code workspace. After that, dvc checkout command retrieves the actual data associated with the dvc meta file into the workspace.
So, to summarize, in this section we discussed the need for having a robust data versioning system in Machine learning projects. We also touched upon challenges of versioning large data files using Git code version control and how we can overcome these challenges using a data version control library called DVC.
Parting Words:
In this blog, we highlighted three key operational challenges during iterative ML experimentation life cycle — ML experimentation logging, configuration management & data versioning. We also demonstrated ways to overcome these challenges with MLOps best practices using popular open-source libraries in the areas of model lifecycle management, configuration management and data versioning. We hope that this blog can potentially work as a good starting point to instill MLOps best practices in ML experimentation stage and also to build a reliable ML experimentation framework.
Written by :
- Sateesh Panda — Senior data scientist, Walmart Global Tech
- Abhishek Mishra — Senior Manager(Data Science), Walmart Global Tech
References:
1: Git Repo to download commands used in this medium blog: https://github.com/Sateesh009/mlops-blog
2: https://cs.stanford.edu/~matei/papers/2018/ieee_mlflow.pdf
3: https://cs.stanford.edu/people/matei/papers/2020/deem_mlflow.pdf