An owner’s manual for your new machine learning system
Congratulations, you are a member of the proud operations team that is benefiting from the new AI infused architecture that is changing the way our modern infrastructure works! YAY! But you may say to yourself, “what does that mean?” or “How does my job of maintaining an orderly IT ecosystem change because we are using AI?”. You have experience with deterministic business processes; these are still going to be vitally important components of your IT system. But, now you also get to manage an element of your system whose behavior changes as a function of a model. This model makes predictions based on data.
Think of a model containing two fundamental pieces: an algorithm, and training data. Training data acts as the configuration files of machine learning (ML) algorithms, and this data should be representative of the data sent at run-time to your ML component. Your new system, that utilizes this machine learning component, is now going to live or die based upon the accuracy of the model’s predictions. If the model doesn’t behave as expected on unseen data, no amount of slick UI/UX will save you. How will you control this chaos?
Introducing: Data Dev Ops
Data Dev Ops is a strategy to manage the data that are used to create and deploy ML models that reside in our system. We don’t need to understand all of the mathematics inside our ML APIs and ML libraries; however, we do need to characterize the performance as a score and track the score of our ML system as a function of the data it uses. If we can measure it, we can optimize it, and get closer to the desired ML performance over time.
Our mission is to track and maintain model performance scores by:
- Keeping model training data in version control
- Capture run-time model inputs and predictions (data exhaust)
- Run experiments to ▶️ Estimate prediction performance on unseen data ▶️ Regression test ▶️ Evaluate predictions captured at run time from the live system ▶️ Optimize run-time parameters
- Promote models only when they have met or exceeded expectations
- Monitor prediction scores over time
Using version control is a lot like eating raw vegetables. Everyone knows that theoretically, it is a good idea, but in reality, your office stocks vanilla crème cookies in its pantry (for free!) and nothing goes better with your afternoon coffee than vanilla crème cookies. So, why should we use version control for data that goes into the model? How will this help maintain my ML system? The answer is version control helps you track changes, backup working copies, enables sharing and automation, and data is just as important as code in maintaining your system. So, eat your vegetables. And, don’t forget to “git commit” the data used to deploy a model.
Model Training Basics
Whether your particular ML component utilizes supervised machine learning or unsupervised machine learning, it requires a data set. In both cases, the data set you use for training should be representative of the data that the deployed ML system will see at runtime. The difference between the supervised and unsupervised training data is that the supervised training data also contains the desired prediction of the ML system with the example input. This training data set has been called different things by different professions, for instance, in remote sensing applications the data set is often called ground truth since it literally refers to the geolocation of a signal. In natural language processing, this data set is sometimes called the annotations, since people mark the natural language with the associated categorical data they want to train their ML to predict. Generically, today we’ll call the ground truth or annotations the label.
Invariably, before you go to production you will have a chicken and the egg problem with your training data. For ML to work, it needs to be trained on the same type of data that it will see at runtime. In the remote sensing community, this run time data is called in situ data, or data collected in the context of the deployed system. However, if you haven’t deployed the system yet, how do you collect in situ data? This problem has to be solved in a case by case basis but often involves using simulated data in your training data set. If you do use simulated data to bootstrap your training data before you deploy your system, make sure it is carefully kept track of and gradually phased out of the data used to create your ML models. After your system is deployed, there will be plenty in situ data is collected (see the next section on Data Exhaust) to train future generations of the model.
Once you have selected a set of data that you will use to train the model, the ML algorithm will create that model and store it for use during the run time predictions. As illustrated in Figure 1, the model is a combination of the data used to configure the machine learning component, and the algorithm that utilizes that training data. The ML run-time configuration can be realized in a myriad of ways, unique to the ML algorithm in which you deployed. This configuration is sometimes referred to as “weights,” “coefficients,” or even “model” (but for my purposes in this blog the “model” is the configuration AND the algorithm that act together at run-time to make a prediction) and its actualization is dependent of the ML algorithm you are using. This could be a file, an object in memory, or it could be completely opaque to you as a user as something that is stored on the other side of an API call.
As visualized in Figure 1, the model is the combination of the ML run time process, and the ML run time configuration. It is created at train time when the ML run time configuration is created and utilized at runtime when the model is passed data and creates a prediction. This means that you could have the smartest, slickest, trendiest machine learning component in your system, but, if your training data isn’t appropriate to your run time data, your prediction accuracy can fail to meet expectations. This dependency on the training data means that the training data is just as critical as source code in the maintenance of the system. You can also improve the performance of your ML component by retraining and deploying a new run time configuration without touching code.
A Place for Everything, and Everything in a Place
The Training Data Set (Figure 1) is a critical piece of infrastructure in your ML system; however, it might not be an exhaustive set of all the data available to you. The ML algorithm might have a strong dependence on prior distributions, so the population of a training class may have to be carefully controlled. In addition to the version control plan for the Training Data Set, also have a plan for maintaining a database that has all known data. In future generations of the deployed model you can experiment with different Training Data Set populations. At a minimum, a project should be able to track what data in the system was simulated to bootstrap the production system and phase out the use of that data over time. Data that is collected in situ should be folded into the Training Data Set in set release cycles. In a supervised ML, that means a human must review the in situ data and define the appropriate desired ML output. In Figure 2, a conceptual relationship between the types of data related to the ML system is shown. The set of all possible training data should be maintained in a place well known and be backed up as you would back up any precious resource.
We use machine learning to find patterns in data, so at its very definition, you need data for ML to work. We replace complicated hard-coded rules to classify input with data containing examples of that input and its classification. For instance, if I wanted a classifier that would tell me if a given image was of a pastoral landscape, or of a house in a city, and I was using a rules-based approach, I would design a set of tests: “How many green pixels do I see in the lower half of the image?”, “Can I detect something that is car-like?”, “Are there visible signs?”. For each of those, I’d have to write code to implement the test using an image as input. In a data-driven ML approach, I no longer have to code rules for a decision, I just give the algorithm examples of pastoral scenes, and examples of city scenes and it defines its own decision logic based on the given input. Easy peasy, right? While this approach is often sold as a way to make systems more robust, and easier to configure, it also adds the additional burden of curating data sets to train ML. Pretend I was a land surveyor who depended on the output of the image classification to apply the correct land use for tax purposes, and the training data is from Florida, and the data I’m sending it is from Wisconsin in February. While our human perception can easily tell that grass covered in snow is still a field, if the ML was only trained on images that contained green pixels in pastoral scenes, it doesn’t know that snow exists, or that things covered in snow still exist under the snow. For the ML component to have a chance of providing a useful prediction the data used for training must be similar to the data passed to the run-time ML component for prediction.
The data observed during run-time, the in situ data, should be captured and cherished. As shown in Figure 2, this data is critical for training, and also this data should be used to capture run time prediction scores of your ML system. Some frameworks call the capture of run time data, and the corresponding ML predictions, collecting the data exhaust, or sometimes more simply as data logging. Whatever you call it, you should persist as much of the in situ data as possible to enable model evaluation and retraining.
Bias and Explainability
By design, our ML models are making predictions based on data, but what happens when certain predictions are heavily skewed to favor certain input parameters? In some areas (lending) there might even be certain compliance issues for showing how once particular prediction was calculated. We use our data exhaust to answer these questions; this is bias & explainability.
To detect bias, we first define what it means in our use case. One person’s bias is another person’s positive correlation. Inside of the data passed in the ML model for a prediction, there might be a protected feature, that as a human you don’t think should contribute to the decision, but due to the nature of the ML algorithm that features is being picked up as a strong influencer to decide between label predictions. If you can define your bias, you can track it, and even correct for it.
Explainability, on the other hand, is the practice of mapping the features of the data that was passed in for a prediction to how much each feature contributed to the prediction decision. Just like your 9th-grade algebra teacher asking you to show your steps in your math quiz. The realization of this varies widely depending on your data and your use case.
Palpability, another use case dependent concept, is a test design to detect undesirable behavior in your ML model. First, define a relationship, that when violated is not palpable. For example, when a consumer pays off the balance of their credit card, their credit rating should go up. Then, take a sample of the data exhaust, vary the parameters in your palpability test, and scrutinize the outcome. In this example, we create a new set of data by changing a sample of the data exhaust such that all data points have a recent payoff. Then, run this perturbed data through the ML Model for a new prediction of credit score. How many of these credit scores when down (worse credit) when you changed the input to reflect a positive change (better pay-off history)? Is it acceptable that the algorithm did this for even a single person (no!)? If you use case includes palpability, make sure to include these experiments in your deploy pipeline!
Experiments, Performance, and Configuration
In traditional software engineering, it is widely accepted best practice to have unit tests for all production software. The goal is clear, a deterministic piece of software should perform exactly as designed for a given set of input parameters. In machine learning, our predictions only ever have a probability of being correct and are designed to be useful on data that wasn’t in the ML Training Data Set, so this complicates the idea of unit testing on several levels. To test ML, we employ experiments and accuracy tolerances to indicate the health of the predictions for a given ML system. In this section, we discuss the three most popular experiments we use in production ML systems; please note, this is not as an exhaustive list but a starting point for types of experiments you may employ your data dev ops strategy.
For algorithms that use supervised machine learning, the Deployed Training Data contains examples of inputs and the desired prediction of the model given those inputs. In this discussion, I’ll refer to the desired prediction as a label and assume a deployed supervised ML model. If you are using an unsupervised ML model (most common is clustering), or regression analysis to predict a real number, you will need to map the concept of “label” to the “desired prediction” for your use case.
k-Fold Experiment: Estimate Performance on Unseen Data
To test your ML model, you will need examples of input data that has known labels, and what better place to get data that you have labels for than your own training data? And how do you estimate the variance of performance? K-Fold experiments are a great way to answer these questions, and still, deploy the most robust system possible. This biggest problem with k-fold? It is only a good predictor when the data in the training set is representative of the data you see at runtime. If this assumption is false, then the k-fold results and the True Blind test results could be quite different.
Figure 3 outlines the basic data flow in a k-fold cross-validation experiment. You partition the data into k equal parts. For each equal part or fold, you train your ML system on k-1 partitions and hold out the current fold. Run the current fold through the trained model and compare the result to the data’s assigned label. Viola!
The k-fold results, the “All Tested data” bucket in Figure 3, is a great place to get insights to your model’s weaknesses. Look at the results and look at the missed predictions with the highest confidence. Is there training data in the confused prediction label that would explain it? Move the contradictory training data to the correct label. Is this a great example of the label, but the only data in the training like it? Find some more examples of that label similar to this one and augment the training set. Remember not to get too excited about the model getting a prediction wrong, because when you deploy your run time model, all of the data in the Deployed Training Data will be used.
Assume you ran your data and received the following scores (k=5):
- Fold 1: 82%
- Fold 2: 90%
- Fold 3: 85%
- Fold 4: 81%
- Fold 5: 88%
When tracking your performance from release to release, I like to show the median data point, and the variance in the experiments, like 85% +/- 5%. Then, I know when to get really excited that I’ve improved my model, and when random chance just drew me a slightly better fold structure for testing.
You may also notice, that both in Figure 3 and the example above, I choose k = 5. I prefer the symmetry of 5 folds, you have an unambiguous center, each fold is 20% of the data, it strikes a balance between run time (high k means more tests), and percentage of data in the training set for each fold (low k means you aren’t using much of your known training data). There isn’t a right answer here if you don’t have a preference then choose my favorite, 5.
Golden Blind Experiment: Regression Testing
Sometimes, in our production machine learning systems, we can experience “drift” over time in the meaning of our labels. Also, for completeness, we like to make sure each prediction is included equally to test for coverage on all the labels. To track this, we keep a data set with known labels called the “Golden Blind” or the just the “Golden Data.”
When creating the Golden Data consider the following rules of thumb:
- Have an equal number of examples for each label
- Try to capture the spirit of the label with data that a human would consider an obvious association with the label.
- Do not use data from the Deployed Training Data, because many ML systems memorize their training data and will bypass the prediction algorithm if they see something from the training set at run-time.
This experiment is a regression to see if the ML is performing as defined, and the Golden Data is the golden standard.
In a perfect world, you should get 100% successful predictions on your Golden Data, but alas, reality can be tricky sometimes. The Golden Data experiment is your unit test for the ML model, and the score of the Golden Data experiment should exceed your tolerance threshold before you deploy the model to production. In practice, I have found that a tolerance threshold of 95% is usually a good target, but it could be 99%, or even 100%, depending on your data and use case.
If you see the Golden Data performance drift below the tolerance threshold, then you must explore if the Golden Data examples are miscategorized, or there was training for labels that don’t match the Golden Data standard. Update the data (Golden Data or Deployed Training Data) as appropriate. If the training data all looks good, and this is a fair example for the ML model, then it’s time to examine ML model’s algorithm.
True Blind Measurement: Live System Performance
Experiments like k-fold cross-validation and Golden Blinds are super useful to measure ML performance before it goes to production, but after you deploy your model, you have something much more useful: log data. The process to get a True Blind score, is simple:
- Take a random sample of the log data, small enough to have a human look at each label prediction, but large enough to expect good coverage on all the labels
- For each sample, ascertain if the prediction was correct
- If the model was wrong, assign the correct label to the input. You get bonus points for this steps because assigning the correct label isn’t necessary for evaluation, but is is necessary to evolve the model.
Since we are trying to emulate a human decision, and humans are prone to errors, when assessing a True Blind, I always prefer to collect more than one human opinion per sample. The measure of all the human opinions on a given sample agreeing is called the inner-annotator agreement. It is amazing how much this can vary from use case to use case and is extremely helpful to measure. People who are new to machine learning often expect their predictive models to perform at close to 100% on unseen data. But, we are trying to simulate a human prediction, and that means the answer changes depending on who you ask. A typical inner annotator agreement on a blind set might be 75%, which means in 25% percent of the questions different people had different ideas on “correct”. So, to define the correct label when the annotators disagree, I use the consensus opinion of the group.
There are tons of reasons that the k-fold cross-validation experiment and the True Blind might have different results; however, the first thing I look at if the data in the Deployed Training Data is different than the in situ data we are collecting at run time. Next, I check if there is a large disparity of the distribution of the data per label in the Training Set vs the distribution of the predicted labels in the run time logs.
In Figure 5, there is an example discrete PDF for examining the distribution of the number of samples per label we see in the two different sets of data: Deployed Training Data, and log data. The most popular label in the Training Data (leftmost bar) represented just over 6% of the training data, and that label only occurred in just under 4% of the log data. Some machine learning algorithms use the relative size of the training examples for each label as a feature, so if you have an intent that is being hit a lot that doesn’t have the same relative amount of training data (like the blue spike at 6% in Figure 5, seventh from left), then you might want to focus on adding data into that label for the next release cycle.
Choose this metric for your report to the brass. This is the rubber to the asphalt measure of the model’s performance.
Optimizing Tau: Configuring Run-Time Parameters
In all the above discussion, I occasionally throw out terms like “metric” and “score” and didn’t even attempt to define it. That’s because when you look at the score of a machine learning system, it is complicated. This is because ML predictions of labels don’t just come with a label, they also come with something commonly referred to the confidence, which generally comes as a float in the range [0,1] (for sake of discussion, I will stick with this interval). This means the subsequent action is dependent on the predicted label, and the confidence the model had in that prediction. If you design a system that only takes an action if the confidence of that label prediction was over a threshold, now the score is dependent on that threshold. For clarity in terminology, the threshold which is a parameter of the machine learning score is defined as tau (𝝉). Defining the score function is often a problem-dependent undertaking and for systems that have more than one ML component, there is often more than one tau that needs to be configured.
I’m going to skip the statistics, and provide a simple example, with three outcomes:
- Prediction is correct, and prediction confidence is above tau
- Prediction is incorrect, and prediction confidence is above tau
- Prediction confidence is below tau
For your system, you need to decide a worth to each of these outcomes. If you have absolutely no penalty for guessing a wrong answer, then tau is zero. If you are predicting something where predicting wrong has a cost (eg it wastes someone’s time), then build in a process in your ML system to handle the case where the ML prediction is below tau. Just think of it as a strategy for contestants on the game show, Jeopardy!. There, the contestants only ring in with the question if they are sure they know it, and if they ring in and get the question wrong, they lose money. If they aren’t sure if they know the question, they just don’t ring in. Their goal is to have the most money at the end of the show.
Our goal in our ML system is to maximize the prediction’s reputation. To find an optimum tau, first select a set of data that is representative of the run-time system data exhaust. A True Blind with the human added column of “is the prediction correct” is perfect for this. Then, assign a relative numeric value to the three situations above, and calculate a reputation score using the following algebra:
Now you have the mechanics of calculating the reputation equation, find the value of tau that optimizes the reputation of the system on the True Blind Data set. Your tau value should be an easy to configure parameter of your deployed system and should be routinely calculated to verify the reputation score of the deployed system is optimal.
Data Model Pipeline
We covered using the common coding best practice of using version control to your Deployed Training Data in the first section, and the data model pipeline is just the logical follow on to that discussion. Pipeline control is used to manage the three phases, Development, Test, and Production:
In development, the members of the team who are curating the Training Data use data from Production to modify the proposed Deployed Training Data set. At fixed intervals, the team deploys the proposed Deployed Training Data set to a Test environment and runs all of the experiments and unit tests necessary to pass the health inspection. When Testing is complete the release is promoted to Production the developers need to have read access to the data exhaust, for changes in the next release cycle.
Tracking Models Over Time
The last component of your new data dev ops strategy is communicating results to your team and stakeholders. Plot prediction performance as a function of time for all of your measurements and experiments. Some other interesting things might include plotting the prediction performance as a function of the size of the Deployed Training Data. Usage statistics are often helpful in most use cases, so you know how many predictions your model is making. Sometimes all predictions aren’t equal, and there may be one prediction, in particular, you need to track as a special situation. Come up with a plan that fits your use case and track and disseminate this data. It could be a web-based dashboard, or it could be a Jupyter notebook that is exported as a pdf file and emailed to the appropriate stakeholders. You have a lot of creative license here, build what makes sense for your team and make sure you monitor the deployed models.
It takes a variety of skills and roles to maintain your ML system including (but not limited to) data subject matter experts, SREs, back-end engineers, front-end engineers, architects, data scientists, project managers, and the person who makes the coffee. Just like when you adopt any new strategy, your day to day work life might change a little bit. Curate the process to be specific to your business use case, and then make sure everyone on your team understands their role to successfully deploy your data dev ops strategy. Happy predicting! 👋