Machine Learning from P.O.C to Production — Part 4

Stable ML Models from ML Engineer point of view

Published in

CodeShake

7 min readFeb 16, 2023

Previously we had seen the roles of a Data Engineer (part 2) and Data Scientist (part 3). Now the last role of these series, the ML Engineer.

Performance in ML projects comes from two essential parts, the features and the model. The main problem in production is potential instabilities within the features at the calculation time due to the previous data pipeline step, internal or external factors.

Also, when updating the ML model, you might compare your new solution with the previous one but how can you compare it fairly?

Features instabilities

Instability introduced by company collecting systems

A data always has a source, which is not always known by its consumers. Thus, small changes in the collecting systems can introduce bias, change in distribution, lack of information into the final exploited data. A classic example should be in a customer form. At first, most of the fields are mandatory and it takes some time to fill it. So, most clients choose to not fill a form at all. To increase the fidelity and encourage customers to register, some fields are put as optional (or pre-filled) to reduce time in filling the form. As a consequence, if you have a product, based on specific information about customers which became optional, you might have a decrease of performance due to lack of information. The same will happen with pre-filled information which will change the distribution of information.

Instability introduced by company business rules

This case can be summarised by one sentence “you are a consumer, not a producer”. This implies you are subject to change on the data sources without knowing. It can introduce bias, inconsistencies or change the logic of information behind your features.

Most of the time, we think about table schema, access path etc… but a more tricky one is a change in the way of creating the data and the modification of the business rules behind it.

Here’s an example from a previous experience: I had a ML product which optimised control over delivery. One of the key information was to know, for each delivery, if it was controlled or not (to build historical features and check performance over time in production). The business app producing this data did not have a field “control / no control”, they told us to use fields like control_duration (in second). If the field was 0, then no control has been done, else people took time to count the delivery content. Thus far no problem. We saw that around 20% of delivery was controlled, which corresponds to previous information the business had given us. Then one day, 99.8% of no-control. After a long investigation, one product, between the original one and us, introduced a rule which truncates the control_duration at a minute level. Since most of the control duration was less than 60 seconds, it truncates to 0.

Instability introduced by external environment

External environments or events can have a significant impact on data distribution. Since your model is based on past data, using the hypothesis where data should be the same in the future, your model can generalise the problem and respond correctly over time. You will only need some retraining to compensate for small evolution. But what if your data evolve so much in a short period of time, that your model will not be able to respond correctly.

One simple example is data over price. Let’s hypothetically think a situation like a pandemic, economic crisis or even a war can create a big inflation of product price. But this data is used by your model, which will lead to behaviour change in predictions.

This graph represents a change in price repartitions. The blue one corresponds to prices before inflation and the orange one is after inflation.

In case of a classification problem, your model might have learned that all prices inferior to 30$ should be of category A and others are more likely of category B. But with inflation, this threshold is not pertinent anymore.

Guessing your next question: Why not just retrain the model on new data?

Well, if the change is too sudden you might not have enough volume of historical data for your model with those new changes. It will be diluted within the historical data.

Here an example if we show the same repartition without scaling the density between [0, 1]:

As you can see, the orange points (new prices) are few compared to older ones. It will be difficult to re-train a model to take this inflation into account.

What a ML engineer needs to be aware of during production?

Everything is about distribution of today

In a previous article, describing the role of the data scientist in stability, I summarised his problem by “everything is about distribution”. On the ML engineer side, it will be close to it. The data scientist has to check its distributions over large historical data (and possibly with more features in the experimentation phase). For the ML engineer, it will be monitoring the distribution over time in production: “What is the distribution of today new features compared to past ones?”. It will also be the simulation of new groups of features, re-training and deploying systems, to reduce time to market.

Models comparison and reproducibility

When you use an ML product, it might not be the only component to formulate the final result (estimation of price, decision in a process, transformation of an image, etc..). If you wish to change your ML model or statistical method, you will compare it with the previous method. But how will you compare it, if you can’t reproduce the exact same conditions of result?

Use case 1: Compare the performance of the whole decision process

You have an ML model which analyses the colour of a food product. If the colour is abnormal, the food might be contaminated and will be analysed manually.
- First decision step: your ML model returns a probability of contamination. If the probability is higher than a predefined threshold, the food will be controlled.
- Second decision step: a random decision, based on a uniform statistical law, which decide to control food for X percent of the time. It will help to detect false negative (when the model gives a low probability but the food was contaminated)

The problem: the confidence in measuring performance. If you re-run all previous cases with the same ML model, you won’t obtain the same results because the random choices won’t be the same. If the X percent of random is 5 or 10%, you might have some different performance in the simulation with the same model, so thinking about comparing with a new ML model is even more problematic when you want to confirm which model is better (unless the performance is drastically different).

Solution: Do not use random functions, but pseudo-random like hashing and modulo. Whether it is a unique id, the input image or anything which identify your individual in your dataset, you can decide with the following process:
hash(id of the individual) mod 100 < X, with X the percentage of pseudo-random cases you want to add in your process

NOTE: Be aware that some hashing methods use SEED, so you won’t have the same results, depending on the instance running your app. You can take a look at farmfingerprint hashing methods, like farmhash of GOOGLE, which are stable and create the same hash every time.

Use case 2: Compare the performance of ML models only

Like explained in the previous article (from a data scientist point of view), you need to split your data between train, test, validation sets. You want to compare the ML model currently in production with a new model.

Problem: Performance can change depending of the splitting you use, if you want to compare two ML models, you need to have the same splitting for a valid comparison. Also, the training and performance comparison session might not occur at the same time for each model. You have created one ML model 3 months ago, and today you need to re-create one. So if you want to compare performance with the same data, you need to remember the splitting used before or re-test your previous model with the new splitting.

Solution: Again, you need to use a pseudo-random splitting method and not a random one (like train_test_split of Scikit-learn). The same set will be used for training and the other ones for validation and tests. Also if you use cross-validation to have multiple performance tests and obtain a mean, the result will be more meaningful if the splitting is exactly the same.

In an ideal ML project, past data will reflect futures data. Everything is delivered on due time and external events don’t affect your data. Well, that’s too bad for us in our imperfect world. If you want to minimise risks in your project, you should answer yes to these following requirements:

For the pipeline execution:

I define/execute tests in the pipeline (unit tests, integration tests, coherence tests)
I define/use precise version of the tools used
I have a retry policy for the pipeline
I monitor and have an alert system on my pipeline
I have a scalable pipeline
All transformations and predictions are reproducible

For data quality:

I have defined quality metrics for each field (range of values, uniqueness, distribution, etc…)
I have defined thresholds for each couple metrics — data when necessary
I have a threshold estimation process
I have documented thresholds estimation process
I can re-run regularly, the estimation process to compare with previous ones
I can re-adjust threshold if needed
I have documented a workflow of decision based on my metrics results (alerting, stopping the pipeline, partially stopping, over-writing or appending data)
I can track and explain distribution drift
I can explain/link each drop of performance in my model inference to my distribution shift when it’s the cause
I can simulate all my previous app use and obtain the same results
I can compare ML models over time on the exact same data

Now that we have seen roles responsibilities in a project, we will take a closer look to modules which will compose your project app. Next part will be about Feature Store and Feature Pipeline.