Could Your Model Recognize a Cow on a Beach? Learning From Multiple Data Distribution
The problem — Do you trust your data?
Before deploying a machine learning (ML) model into a product, we need to evaluate its performance. A well-adopted practice is to split the data into a train set (data used for training the model) and a test set (data used for evaluating the model). This prevents delivering a model which has memorized all the training data, making it unable to infer correctly from new data, ( i.e. not seen during training). This protocol decides on the outcome of the model: is it good enough to be embedded in a product?
What we see here is a common pitfall when deploying ML models: assuming that our data is representative of the data the model will receive in production.
There are many reasons to believe that we should not trust our data too much. Most of the time, the data acquisition pipeline is never as good as we would like it to be. For instance, data may get lost or be unusable (e.g. columns not correctly filled in, file cannot be opened, wrong reconciliation when joining databases). This means the data in the hands of the data scientists could be different from the real data. An experiment conducted by Beery et al. in  brings to light that DL vision models generalize poorly when classifying images sampled in an unexpected context. For instance, a model which accurately identifies cows in a pasture, may be incapable of detecting them on a beach (see Figure 1 in ).
Generally speaking, slight shifts in the data create spurious patterns, for which the model is very greedy; it uses them to boost its performance. Consequently, in vitro performances are not production performances of the model. Often, it is impossible to quantify the difference.
The better your model handles dataset shift,
the more reliable it will be in production
More importantly, the impact of an ML model deployed in production is closely related to its ability to handle a wide variety of cases. Let’s take an example I experienced as a data scientist with Sidetrade. Ultimately, I want my model to deliver value to clients. Since clients who use our solution are different, considering their specificity is crucial when designing a model. But it took me some time to realize this is not always possible, or even desirable!
First, building a client-specific model requires a lot of engineering. Does the ROI justify such an investment? Second, Does each client have clean, valid data in sufficient quantity? Third, and most importantly, do client-specific characteristics change over time?
The more the model relies on invariant features, the more reliably it can infer from data not seen during training (e.g. new clients, time-dependent data).
“Is there a cow in this picture?”
As simple as this question seems, consider how you decided upon your answer, and how you could train a machine to reach the same conclusion. You presumably decided based on distinctive traits such as the udder, horns or tail, which tend to be considered “universal” features of a cow. From a machine learning point of view, the presence of these characteristics are “arguments” for the presence of a cow.
But is our decision as simple as it seems? Is it still a cow if we can’t see its horns? And what about the other elements of the picture? Are the black and white markings part of how we identify a cow? Is the green pasture a salient detail? The problem becomes more manifest, if we train a machine on Holstein cows (as shown above) and then show it a brown Tarentaise cow pastured on dry grass. Human intelligence will understand that the Tarentaise is a still a cow. Will your AI model be just as clever?
For machines to accomplish this mental leap (generalizing that Holsteins and Tarentaise are cattle despite their differences), the data scientist has to watch out for spurious patterns. This refers to a mathematical relationship where events co-occur (e.g. cow + pasture) but are not causally related (i.e. due to coincidence or outside factors). Models that exclude spurious features are highly valuable, especially when the data is varied and complex.
How can I prevent the model from learning spurious patterns?
We refer to the source distribution as the process which generates data for training the model, and the target distribution as the process which generates data for testing the model. The term data shift refers to a change occurring between source and target distributions. For example, if we trained our model on a 2019 dataset, and then tried to generalize to 2020 data, accuracy could fall. The capacity of the model to generalize to a poorly-known distribution (i.e. the target distribution) is called Out-Of-Distribution (OOD) Generalization.
Whereas humans are good at OOD generalization, ML models are terrible at it!
A test was run with three comparable datasets below. The first contains images from the Amazon website; the second, webcam photos; and the third pictures shot with a DSLR camera.
To the human eye, there is no doubt that the three sets show the same objects: bikes and cycling accessories. A deep learning model, however, was thrown off by the differences in photographic quality and lighting. The machine had no problem with the sharp DSLR pictures. Performance dropped with the webcam set, and was actually poor with the Amazon set.
How to adapt your model to a new distribution?
Making your algorithm generalize on new data is challenging. The main obstacle is not having information about the target data at training time. As a practitioner, you have two solutions, one ideal and the other more pragmatic:
1. run an annotation campaign to get enough quality labelled target data. The downside is that this is time-consuming and costly. What’s more, quality can be hard to assess.
2. leverage what you have, and reduce source-target discrepancy between as best you can.
Actually, data scientists are used to reducing train/test data discrepancy! Such a process includes data processing, data selection, and variable encoding, to prevent the model from memorizing the training data.
When considering source and a target distribution, the problem is even harder; the distributions themselves are different! In the best case scenario, reducing source and target discrepancy requires a lot of domain expertise. In the worst-case scenarios, it may be impossible. For example, manually removing misleading words from a text or the pasture from a cow photo would be tedious and impractical.
Unlabeled target data is often affordable
Obviously, reducing source/target discrepancy needs additional assumptions or better knowledge about the target data. Here, I will focus on the role of target unlabeled data for addressing this problem. Of course, access to a sufficient amount of unlabeled data is not always possible. However, such data is much cheaper to acquire than running an exhausting annotation campaign, making it an appealing alternative.
We refer to Unsupervised Domain Adaptation (UDA), the paradigm which consists of leveraging labels from the source domain while reducing data discrepancy based on unlabeled target data. This is similar to transfer learning, since we aim to transfer knowledge from a source domain to a target domain without supervision in the target domain.
Learning to weight your loss
One strategy is to understand which source data is similar to the target data. Identify source data to reject or focus on at training time. Weight the contribution of each source sample in the loss used to derive the model’s parameters. This is usually done by computing the ratio between source and target features. This provides a good estimate of the importance of a source sample in learning a good model.
Above, we see a linear regression, where samples in the source domain are weighted in order to better represent target data. When fitting the model to source data, the model generalizes poorly on target data due to over-representation of source data, where x lies in [-3,1]. Conversely, when fitting the model only to source data close to the target data, the model generalizes well on the target data.
It is crucial to observe that this strategy is valid provided that the labelling function (i.e. mapping from the features to the labels), is conserved across domains, a situation referred to as the Covariate Shift or Sample Selection Bias. This situation is backed by strong theoretical support, which shows that the target error can be approximated using weighted source samples.
Well, problem solved? Not so much actually. Computing weights exposes your model to higher variance! To get some insight into this phenomenon, let’s say we have n source samples. Here, learning theory states that your model error is exposed to a variance risk, which is proportional to (log n / n)^(1/2). In other words, the more data you have, the lower the risk of variance. In this context, all samples have the same importance (1/n) when computing your loss. By promoting or rejecting source samples, weights have the same effect as reducing the number of available samples at train time, thus increasing the model’s error variance. In fact, the more the target data lies in low-density regions of the source data, the more weights reject source samples, the noisier the learned model.
Learning domain invariant representations
The situation may even get worse when the source and target data do not overlap. This typically occurs when dealing with high dimensional data (i.e. text or images, for which a lot of information is encoded). This leads to unbounded weights! In this context, the data is so different that the weights will not reduce the source and target discrepancy.
This is where Deep Learning comes in- a technique which aims to reconcile two distributions with non-overlapping supports by learning a representation of the data through a feature extractor, so that it is impossible to distinguish from which domain it is sampled. From the perspective of a domain discriminator, such a discriminator cannot perform better than at random when classifying domains from the representations.
The target-adapted model is obtained by achieving a trade-off between a source low classification error while extracting features from the data which do not carry domain-specific information, named invariant representations . Learning invariant representations is usually done by fooling a domain discriminator. This is why it is called adversarial learning (the setup is similar to the GANs for generative modelling).
Quite surprisingly, learning domain invariant representations for unsupervised domain adaptation is backed by a strong theory from . Roughly, this theory states that the error committed in the target domain is lower than the source domain error, plus a measure of divergence between source and target distributions, plus an additional term called adaptability of representations. Adaptability embodies our capacity to learn, from the representations, a well-performing model in both domain. In most of the domain adaptation literature, it is assumed to be small… is it?
Some pioneer works aim to better understand under which condition this term remains small . The major challenge comes the fact that adaptability of representations is not tractable at train time (it needs target labels for being evaluated). For the most general case, there is an inherent trade-off: the more domain invariance, the higher the risk of bad adaptability. Therefore, practitioners need to investigate weather domain invariance does not lead to poor representations, thus, bad classification in the target domain.
In this post, I have detailed my interest in Unsupervised Domain Adaptation for addressing a real-world problem: addressing the lack of robustness of ML models when facing data shift. I have introduced and detailed in which context weighting source samples in the loss can help to reduce distribution discrepancy. Unfortunately, this strategy may not always be applicable since, most of the time, the data shift results from non-overlapping data. In this context, learning domain invariant representations may help to reconcile the source and the target domain. The main drawback is the loss of control over the semantic embedded in the features. Therefore, obtaining enough guarantee when leveraging these models is important. Weighting and invariant representations are significant issues. We must be able to guarantee the robustness of our model before deploying them in production.
 Beery, Sara, Grant Van Horn, and Pietro Perona. “Recognition in terra incognita.” Proceedings of the European Conference on Computer Vision (ECCV). 2018.
 Ganin, Yaroslav, and Victor Lempitsky. “Unsupervised domain adaptation by backpropagation.” International conference on machine learning. 2015.
 K. Saenko, B. Kulis, M. Fritz, and T. Darrell. Adapting visual category models to new domains. In European conference on computer vision, pages 213–226. Springer, 2010
 V. Bouvier, P. Very, C. Chastagnol, M. Tami, and C. Hudelot. Robust domain adaptation: Representations, weights and inductive bias. arXiv preprint arXiv:2006.13629, 2020
 S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. Analysis of representations for domain adaptation. In Advances in neural information processing systems, pages 137–144, 2007
 H. Zhao, R. T. Des Combes, K. Zhang, and G. Gordon. On learning invariant representations for domain adaptation. In International Conference on Machine Learning, pages 7523–753