#9 — Train serving skew & data dependency problems

How to mitigate risks from data problems in productive machine learning models.

Nicolas Rodriguez Presta

Published in

Mercado Libre Tech

4 min readNov 18, 2021

Has it ever happened to you that lab results are very good, whereas results in production are not?

You can make your best efforts to ensure your data will not have any problems that make them invalid in production; you can add tests everywhere and guarantee that the distribution of the data used for training and the productive data are coherent.

This is all right, but it would be a mistake to act on the assumption that these mechanisms will work. You’ll find a difference between your training data and your production data. The best you can do is make assumptions about what’s likely to occur and act accordingly.

This is especially important in large, complex systems with multiple data sources.

Fraud Prevention models in Mercado Libre consume lots of sources, historical aggregate data over different entities (like users or devices), real time aggregate data, contextual transactional data, graph relations, etc.

And, as a rule of thumb, I would say that the probability that your training and serving data are consistent is 1 / (2 ^ N) , where N is the number of different data sources you have. The more data sources, the more likely it is that one will fail.

If you have managed your technical debt well, the data sources you depend on will probably be relatively “under control”. But that doesn’t mean that bad things can’t happen such as a service crash, a change in an API signature, a non-respected field name or product changes. An affected data dependency (including its dependencies) can have an impact on the final prediction and the performance of your model in serving time. So, good technical management implies gaining more knowledge about potential failures.

Let me share with you some ideas that we use in Mercado Libre, to help you prepare for the worst:

Keep an eye on the output of your model to detect anomalies in its expected outcome (pay special attention to the lowest and highest distribution percentiles).
Generate a system that monitors the signals (features) coming into your model. In an ideal scheme, if we have control over all data producers, they would be the ones to detect errors and report them to their consumers. But in practice, becoming more defensive and getting ready for the worst is sometimes the best strategy to adopt.
Develop procedures to follow when the output or the inputs signals are reporting issues, in order to fix them or mitigate the impact.
Keep an SLA for the most important features (using the model’s feature importance can help categorize those that are most relevant to watch).
Prepare your model to tolerate noise, especially if the online and offline data sources are different: 1) Use regularization in the training. 2) During the training, apply feature dropouts, especially for those features likely to fail online.
Try to make sure that your data generation (your ETL) takes into account the SLA restrictions in serving production (you need to know those restrictions!). In ETL, we wouldn’t have issues like timeouts, latency or cache update delay. But in production we do have problems! Try to make your ETL reflect as much as possible the data distribution that your serving will generate, which is not necessarily “reality”. In other words, prepare your model to cope with the production “noise”, injecting the same “noise” distribution in the training dataset.
If you already have examples of data points that were “affected” in serving time, it is a good idea to correct those errors for use during training. However, it is also a good idea to train your model with those data points so that it learns how to “defend” itself in case of recurrence. The trade-off is that this could affect the performance of the model.
Check out the feature sanity in every retrain, in order to fix or depure the features more likely to fail.

If you follow these tips, your models (and your metrics) are less likely to be affected, or, if affected, will be less likely to generate and economic impact. So, it will be safer to have a model in production!

We are near the end of this series of articles, only one more check ahead.

You can now get ready for a safe landing!

#9 — Train serving skew & data dependency problems

How to mitigate risks from data problems in productive machine learning models.

Written by Nicolas Rodriguez Presta