Machine Learning from P.O.C to Production — Part 3

Stable ML models from a Data Scientist point of view

Florian DEZE
CodeShake
8 min readFeb 9, 2023

--

In part 2 we saw the responsibilities of a Data Engineer, to ensure a good quality of data. Now let’s study the responsibilities of a Data Scientist. This article will not talk about all the aspects of the work of a Data Scientist but only the technics used to prevent instability in production and reduce risks over time.

Source: https://www.kdnuggets.com/2017/08/37-reasons-neural-network-not-working.html
Source: https://www.kdnuggets.com/2017/08/37-reasons-neural-network-not-working.html

Everything is about “Distribution”

This sentence resumes the main problem a Data Scientist will encounter to ensure a stable model.

When you build a model, you feed it with information based on the hypothesis of “similarities” between past and future information. One used case could be creating an image recognition model of humans, where you assume that humans nowadays will be the same tomorrow or the next few months and years. But this hypothesis is not always true: we call that distribution shift (or drift) and it will lead to performance decline over time. You can simply visualise it with data about prices. There are often things like inflation or other market behaviour which will drastically change prices, making past information obsolete.

Distribution problems in Production by types

Covariate Shift

For supervised ML training (having features and labels), you might encounter a covariate shift. It consists of a change in distribution of one or more features, while having labels unchanged. An illustration could be the creation of a credit score on a bank for loan. You take the salary and other revenue (and other features) to determine the risks of a new credit for the customer. Let’s say the algorithm determined that a revenue less than 1600€ increases the risk. If high inflation occurs, one of the measures is often to grant a promotion at the same rate, which will increase the number of people over 1600€, but the wealth and risks of credit should not change (because daily life is more expensive). The model will more likely be favorable to giving credit than before inflation.

How to detect and prevent it ?

  • Rule 1: Remove features with high unpredictable shift

If a feature has too many shift occurrences over time or too drastic, you should consider not including it, unless you have high confidence in your CI/CD/CT and monitoring for detection.

NOTE: the definition of ‘too drastic’ is quite arbitrary. It’s hard to efficiently judge what threshold of shift is tolerable. You can do simulations by adding artificial shifts and measuring the variation of performance, but if you have many features to test, it can be time and resource-consuming. What you can do to reduce the number of tests/combinations is to measure the feature importance for your model and focus only on ones with a big impact on the decision.

  • Rule 2: Re-training the model with fresh data

Re-training the model might help if the shift is smooth over a long period of time. You will have accumulated enough recent data for your model to take into account that shift. You can also use “Data Augmentation” methods to increase recent data samples. Be aware that a drastic shift reduces the possibility of using this technique (not enough recent data).

  • Rule 3: Prepare for removing features

If you decide to not follow rule 1 (drop of performance too high compared to the risk of feature shifts) and you predict a risk with rule 2, you need to at least train a model without the problematic features, to ensure you can build a model with a performance higher than the threshold decided with the business.

If you can’t, it means the days the shifts happen, you won’t have a backup plan.

Label shift (or Prior probability shift)

For supervised ML, it’s when the distribution of the label changes but the features do not (the exact opposite of Covariate Shift).

For example, in case of an economic crisis, people tend to buy fewer houses, which drops the price of the markets. If you have an algorithm which estimates the price of the house, then for the same features (house size, number of bedrooms, etc…) the price will change.

How to detect and prevent it ?

The detection is quite simple, you can plot multiple distribution diagram of your label, chunk by time period (or between train / test) and see if there is any change. It’s a graphical approach, but you can use a statistical approach if you want (in python: scipy.stats.ttest_ind)

The correction of this use case can be quite a headache, because you can’t remove the Label compared to Covariate shift when you removed the problematic feature.

  • Rule 1: reduce history

Most Label shifts occur over temporal reasons (house prices change over time). Taking the last 10 years of house transactions is less pertinent than taking only the last year of transactions

  • Rule 2: Remove anormal distribution

If the label shift is correlated with time, you can remove the training data which does have different distribution than your test set

  • Rule 3: Last resort

You can adjust past label data values by estimating shifts on newer data and apply it on old data. We can illustrate it with an house price use case: You can regroup “similar” houses in characteristics (apply an unsupervised classifier). Then, for each group, calculate a mean ratio between old prices and new ones. Re-write old data prices by applying this ratio. This will make old prices similar to the new ones and reduce distribution shifts.

NOTE: It can be hard or even not possible to automate these kinds of analyses and re-adjustments. At least, you need to monitor and measure these behaviours carefully. This will reduce your team’s analysis and correction time.

Concept shift

This drift is a change in definition of your labels. You can see it in ML projects like determining what is fashionable or a job title which can be interpreted differently from countries, regions of experience, etc…
It is the hardest type of problem to solve and the resolution method, if it exists and applicable, will depend on the situation.

  • method 1: Periodically re-train
  • method 2: Weighting data. Put more weight on “representative” data or recent data
  • method 3: Data augmentation on newer or “representative” sample
  • method 4: Creating a new model. If the concept drift is too sudden (due to an important event like COVID, war, economic crisis, etc.), completely re-creating your model can be the only solution (change algorithm, hyper parameters, train/validation/test sample)

NOTE: “Representative” is not defined precisely because it will depend on the problem and your knowledge about your data. “Representative” can be the contrary of outliers, or more recent data.

Distribution problems in Training: Overfitting and Underfitting

Those problems are also related to distribution. Overfitting is about over-training your model on data which disables the ability to generalise the problems when used in production. This problem can be reformulated by having distribution in your training set, not the same as in real life. You can do the same reflection with under-fitting where your data distribution is not enough to represent the whole cases.

How to detect and prevent it?

  • Rule 1: A good splitting

When you create your dataset for training and testing you can create this kind of repartition:
- training set: used for training the model
- validation set: used to test the performance of your trained models between each change of hyper-parameters
- test set: used to test the performance of your final model

Most overfitting comes from hyper-parameterisation, when you choose some parameters for your algorithm before training and change it back to see if you perform better (most of the time, using gridSearch or Bayesian Optimisation).
The problem: “is the performance better in general or only on your test set?”. To prevent it, we introduced a validation set, which will be used to compare performance in each model / simulation. Once we choose the best one, we measure the performance again with a test set. If the performance does not drop, it means you have a lower risk of overfitting.

  • Rule 2: Ensure good repartition of labels

If you take human face recognition use cases, you might have seen some problems in the media. Some people couldn’t be recognised by their smartphones and it happened more frequently to women and people issued from a minority (from USA and European point of view). The main reason is the database of images which contains mainly white men (more than 50%), and others which were under-represented in the database. So, the model couldn’t train well for those people. But why was it not detected beforehand ? That’s simple, the score was highly performant in global, so it was not detected at first, until the social media reaction over the number of complaints.

  • Rule 3: A good repartition of features

After splitting your different datasets, you can check if there is any covariate shift between them. The method is simple:
- Step 1: add a new label (0 for training set, 1 for testing set)
- Step 2: concatenate and shuffle those two datasets

- Step 3: train a classifier to determine if the data are from the train or test set
- Step 4: if the score is high, it means your distribution is not the same between train and test. Redo your splitting

  • Rule 4: add training difficulty to your model

If you want to generalise a problem and prevent your model to fit too much on your train dataset, you can give some difficulty to your model to learn from those specific data.
Some non-exhaustive techniques:
- Adding noise: add some small random value to your features (ex: a value of Feature 1 will pass from 0.34 to 0.343). It will add approximation to the data
- Constraints (L1/L2 regularisation): Constraint weights associated with features to be in a certain range of values
- Dropout (for Neural Networks): Deactivate some neurons randomly at each step of the training. It creates approximation in the optimisation process

  • Rule 5: Cross Validation (CV)

Cross validation regroups methodologies to measure the variation of performance between test sets and see if it is stable or not. You train /test your model on different sets of data (with same hyper-parameters and SEED) and check the variation of performance. It is also useful when we have few data. It consists on training and testing the performance of the model multiple times and assert a stable performance or get a mean performance. The difference between each cross validation methods are the splitting of train / validation / test sets. A list of non exhaustive methods are K-Fold, TimeSeries Split, Blocked CV (and Blocked Time Series CV)

To conclude, a Data Scientist has to guaranty a good distribution of its features and labels, the study of risks for each of them and the use over-fitting/under-fitting prevention methods.

In the next part, we will see the responsibilities of a ML Engineer once the model is created.

--

--