What do we talk about when we talk about ML robustness?

Published in

Data Science at Microsoft

7 min readDec 21, 2021

After finishing a great one-time ML analysis, it is natural to expect it to lead to a reliable, permanent solution. However, this notebook-to-service transformation can be tricky, as I have come to realize throughout my experience of putting ML models into production. The real world is unpredictable and ever-changing. Data can be corrupted. Models might misbehave when exposed to new data.

This is a significant risk in high-stakes applications. But even when bad predictions are not catastrophic and can be fixed retroactively, robustness still matters: A brittle model means high maintenance that can quickly exhaust data scientists and engineers, leaving little time to develop new models.

A partial solution

“Let’s just retrain the models dynamically in response to shift. Problem solved.”

The most intuitive solution is to retrain models on a regular schedule so that they remain “fresh” despite shift. However, parameters such as training frequency and how much historical data to use are often decided quite arbitrarily. To this end, I experimented with dynamic retraining using an online Martingale exchangeability test (see this related paper for more information).

The main reason for using Martingale rather than better-known distribution shift statistics is to avoid “batching” time-series data into arbitrarily defined intervals for the calculation of distribution statistics. Although our models are not yet in the online learning regime, we aim to maintain as little data in memory as possible. This approach only requires updating the betting capital (a real number) and thresholding it in direct correspondence to the p-value against the exchangeability assumption (no hyperparameters required). The exchangeability test is done on loss, overcoming the trickiness of estimating shift on a mixture of high-dimensional numerical and categorical data.

Figure 1 is an experiment run on Amazon book review data from 1996 to 2017, showing a lower loss with dynamic retraining than fixed-schedule training.

Figure 1: Using online Martingale exchangeability test for dynamic model retraining.

This is a better alternative to fixed-schedule retraining, but as I came to realize, it’s missing the point.

What is ML robustness?

“Oh, you mean covariate shift? Label shift? Adversarial robustness?”
“Actually, I mean this deployed model will not fail in the next month.”

Underlying the two statements above are two different notions of “ML”: One is a mathematical function, and the other is a software system with an ML component. As part of my work productionizing ML models, I kept a log of all model failures in production for three months. I found that surprisingly few of them are related to distribution shift.

Many issues lie at the intersection of models and software. For example, a miscalibrated anomaly detection model might feed twice the number of anomalies to the next model, causing an out-of-memory error. Even for pure software bugs, the uncertainty and long run-time caused by the ML component make them much harder to detect. Such issues suggest that models and related software components should be considered holistically.

Robustness as “passing all tests”

“ML robustness is not robust models plus robust software wrappers.”

What is a meaningful way to formulate software robustness? Rather than devising a theoretical worst case or an average case, an engineer might simply say the software “passes all tests.” And the testing may very well be comprehensive: unit tests, integration tests, mock traffic, pre-production environments, and so on. In fact, a software component might be run hundreds to thousands of times before being exposed to the user.

However, training a model under all reasonable conditions (assuming one can generate such conditions, which is an interesting problem in itself) would probably take a prohibitively long time. Additionally, while software is generally deterministic, ML depends on ever-changing data, which is inherently unpredictable. In software, creating mock traffic or using automated random tests is often done to increase coverage, in the hope that they can expose code to complex conditions that are otherwise unanticipated. But this is much harder to do when data is involved.

Even if exactly how the data may shift is known, and there is enough time to test on shifted distributions, what would the “correct” result be? The issue is that ML models are attempts to understand the world via data. While software engineers have the freedom to design (and thus the responsibility of ensuring) desired outputs given certain inputs, data scientists can merely make their best attempt at answering questions with the available data and tools. There remains the question of “what is possible?”

It is obviously unreasonable to expect good performance when input data is complete noise. However, how robust can we expect our models to be? This is impossible to answer prior to deployment (and likely after deployment too). This unknowability issue, coupled with traditional software challenges related to memory, compute time, and dependencies, results in great complexities when defining what a robust ML system should be like.

Data consists of more than distributions

“I dislike the term ‘distribution shift’ — it implies too many assumptions.”

Calling data distributions implies a statistical perspective, which makes it easy to ignore data’s physical attributes — volume, storage, methods of retrieval, ordering of various streams, schema, skewness of partitions, and so on. All these aspects are closely coupled with models, and distribution shifts capture only one of the many dimensions of the challenge.

Additionally, model development is usually an iterative process. My original conception of “a static deployed model in changing distributions” is indeed quite off (at least for the platform I work on). Due to frequently changing business logic, upstream data sources and stakeholder requirements, data scientists routinely add new features or data sources to improve their models, changing the input space.

My “eye-opening” moment was when my team customized a data drift detection service into a much simpler tool. The emergence of such services in the market shows data shift to be increasingly relevant in various ML applications. However, to my surprise, the overwhelming attitude from our data science team was that they prefer simple and intuitive metrics (such as change in count/mean per column) over metrics such as KL divergence and Wasserstein distance.

The central issue is that for data drift, the ultimate goal is not merely discovery but correction. If all shifts are natural, KL-type metrics can be helpful. However, when shifts happen due to human or system errors (which are much more frequent than I thought!), straightforward metadata monitoring is more helpful in pinpointing root causes and informing mitigation plans. After all, if an outage of one specific data source causes one feature to suddenly contain 70 percent missing values, KL divergence or Wasserstein distance is unlikely to be helpful. Similarly, for issues related to model dependencies, such as the previously mentioned out-of-memory error, a direct, straightforward metric would be most helpful.

Additionally, there is the “data-centric versus model-centric” argument: KL-type metrics, although defined on distributions, are in fact model-centric: they need to be calculated separately for each model and recalculated when a model adds or removes certain features. On the other hand, data-centric validation techniques (such as deequ) might just be an easier solution when multiple models have intersecting inputs and model features are updated frequently.

Even before the explosion of ML, data engineers have been wrestling with data quality challenges. The sensitivity of ML model performance to data quality highlights the importance (and trickiness) of these challenges. At least in my team, the good old data quality frameworks are still favored over the newer ML-specific solutions.

Solutions for the short and long term

“If robustness is so hard, what can we do with our ML systems today? What about tomorrow?”

While we might not be able to solve the robustness problem today, we can still take some steps to mitigate failures or speed up corrections. For example, we can work toward building systems that emit timely and intelligible signals upon failures, adapt to environment changes, and even potentially self-correct. Even though we might still need data scientists or engineers to retroactively fix or tweak these systems, we can build tools to make this process easier. Here are some practices adopted by our MLOps team:

Before go-to-production: In addition to verifying successful model runs in the development environment, we use a customized automatic analysis tool to scan the program code to avoid frequent reasons for failures observed in the past.
While in production: We apply data schema and quality checks on datasets, and monitor drift with a customized version of Azure Machine Learning’s data drift service. Violations trigger alerts or stop model training.
Post-hoc debugging or correction: We use dependency mapping to track model and data dependencies. We increase automation in the deployment process to ensure timely correction. Time-travel functionalities are in-progress to enable easier debugging.

We also envision testing frameworks that can perturb input, vary data load (or simulate outages), and even adversarially shift the data to “stress test” ML systems. To speed things up, some tests might be run on smaller mock datasets. From an organizational perspective, the impact of ML models should not only be measured on their first successful run, but also account for their reliability over time.

The long-term solution might be more complex. ML robustness is not only a technical issue, but also fundamentally related to our vision of ML-human relations. For example, do we want to build auxiliary tools that help human practitioners tackle these challenges more efficiently, or do we strive for increasingly self-sufficient systems that can automatically adapt to the dynamic environment? Do we hope to empower data scientists to put their models into production in one click (and thus be responsible for potential failures), or do we believe in a strong engineering presence in any software product?

These questions do not have an absolute answer. However, as ML becomes more closely coupled with various systems across domains, the challenges of ML safety, reliability, and maintenance require us to take robustness seriously. Although the reflections in this article are based on one particular ML system, we hope some of the challenges could be relevant to other application scenarios as well. ML robustness is a complex, multidimensional challenge, and its solution depends on the collective wisdom of both engineering and data science.

Special thanks to Paul Mineiro for the discussions and inspiration, and to Huizhen Ji and Venkata Subramaniam for their helpful feedback.

Ziqi Ma is on LinkedIn.