Understanding the Role of Dataset Shifts in Domain Adaption
Responsibly Advancing Data-Driven Decisioning in Machine Learning
Machine learning may be the buzzword in tech right now. And while some other cutting edge technologies have failed to live up to expectations, the insights and performance that machine learning delivers are worthy of merit. The ability to derive powerful conclusions on a variety of data is paramount to what Capital One and many institutions are seeking to do; machine learning helps further us along that journey by extending our ability to make data driven decisions at an even more granular level.
Domain adaptation is a continuation along this journey. It is often associated with transfer learning, the open research area of storing learned knowledge in one problem case and applying it to another problem case. In domain adaptation, we seek to construct an effective model trained on one source dataset and to use this model to make accurate classifications and determinations on another target dataset.
What is Domain Adaptation?
Domain adaptation, or multi-domain learning, is an advancement on single domain learning (the non-domain adapted ML use case). Thus, to better understand domain adaptation, let us first consider the more common machine learning base case of single domain learning:
In single domain learning, as opposed to multi domain learning, the main differentiator is the fact that both the training and test set come from the same distribution. The common distribution found among the training and test set relates to the label of “single domain” — a single distribution equates to a single domain. As a result, the expectation and likelihood of a classifier trained within this distribution delivering accurate class predictions is very high. Single domain learning is the machine learning paradigm most readers will be familiar with.
Multi-domain learning involves a similar infrastructure, but we now have multiple dataset distributions, enabling the use of domain adaptation:
In multi-domain learning, the training set and test set are from two different distributions. The differentiation between these two distributions leads to the labeling of “multi-domain,” as each distribution is considered to be a unique domain. In this model, we seek to once again obtain accurate class predictions on both the training and test set.
In an ideal world, this multi-domain model is as consistently successful as the single-domain model. Unfortunately, the discrepancy between the two domains makes default machine learning techniques inconsistent and introduces new found risks. As powerful as domain adaptation potentially is, equal importance lies with the need to understand the risks associated with multi-domain learning.
Motivating Domain Adaptation
Before we dive further into the risks involved with domain adaptation, let’s consider a potential use case for multi-domain learning, laying out why it is still worthwhile to deal with these risks in order to expand our machine learning capability.
Imagine this: a startup hopes to provide a new mobile application authentication service for users that involves authentication through facial recognition. It wants users to be able to use the front-facing cameras on their smartphones. At that point, the image is sent to their model, which predicts whether the current user is indeed the user who desires to login to the application.
This model requires training. Companies that choose to use its service need images of their users to build an accurate image classifier; however, the task of collecting new image sets of users is difficult, invasive, and unlikely. As a result, the startup decides to make use of the data that their customers currently have — the passport photos that users submitted along with their application when becoming an affiliate of their customer. The company uses these images to train a model that accurately classifies users using classic cross validation metrics and common dataset splits. This training set establishes the parameters needed for the maximal successful model.
Now, they move to the real world. They want to make this model actionable. Believe it or not, the use of this model in a real-world use case is an example of domain adaptation. When transitioning to the images provided through its users’ cameras, the startup finds that the data itself is indeed from a different distribution than their training data. Customer images seen through the front-facing camera are found to be different than those submitted at signup as a result of factors such as occlusion, lighting, facial positioning, camera quality, etc.
When properly applied, domain adaptation solves many problems that may arise here. However, when improperly utilized with no consideration of the involved risks, domain adaptation leads to inaccurate predictions, and users unable to effectively use the application.
Domain Adaptation is difficult to utilize perfectly/well, but when done correctly it is tremendously powerful. Thus, it is important to be aware of the risks and how to mitigate them.
Domain Adaptation Risks
In industry, the risks associated with domain adaptation lack a standard terminology. You will often find terms such as data fracture, dataset shifts, or changing environments used to describe the same common group of risks. For the purposes of this post, we will primarily use the term dataset shift.
Dataset shift refers to the existence of a difference between the two distributions that will lead to unreliable predictions. Typically, these differences or shifts can be categorized into three labels:
- Covariate Shift
- Prior Probability Shift
- Concept Shift
Let’s break them down.
Covariate shifts are found to occur “When the data is generated according to a model and where the distribution of changes between training and test scenarios” [Storkey, 2009]. Essentially, the model processes the data and fits it to a target outcome that appears similar for both the training and test set. However, when considering the individual input sets themselves, there is a clear discrepancy between their distributions (covariate shift is the domain adaptation risk found in the example described above.)
Ptraining(y|x) = Ptest(y|x) and Ptraining(x) ≠ Ptest(x)
Prior Probability Shifts
Prior probability shifts are found to occur where there is a change in the distribution of class variable. Informally, prior probability refers to a situation where when one considers the distribution of both of the outputs (training and test), the distribution is different.
Ptraining(x|y) = Ptest(x|y) and Ptraining(y) ≠ Ptest(y)
Often considered to be the hardest form of dataset shift to identify, concept shift occurs when the relationship between the input and class variable changes. If we consider the distribution of the mapping of input to output, and consider the two distributions for this situation, concept shift will be found in situations where this distribution is disparate.
Ptraining(x)= Ptest(x) and Ptraining(y|x)≠ Ptest(y|x) in X to Y problems
Ptraining(y)= Ptest(x) and Ptraining(x|y)≠ Ptest(x|y) in Y to X problems
Causes of Dataset Shifts
Dataset shifts can typically be attributed to two main causes:
- Sample Selection Bias
- Non-stationary Environments
Let’s break them down.
Sample Selection Bias
When dealing with sample selection bias, distribution discrepancies can be directly connected back to biased methods for training data collection. This bias can be attributed to a prioritization for better cross validation results, non-uniform population sampling, or issues regarding data integrity.
Non-stationary environments involve a temporal or spatial change between the training and test environments. In other words, the environment itself has been manipulated in such a way that the data itself is found to be different in terms of time series or sparsity. This situation is typically found in adversarial classification problems (such as spam filtering or network intrusion detection) and involves the situation where the adversary warps the test set to create a dataset shift.
Handling Dataset Shifts
Managing the existence or possible existence of a dataset shifts is a two-step process. To begin, the engineer must first detect the existence of a dataset shift (if one is present), and then they go about remediation [Note: remediation is a broad and convoluted discussion. For brevity and the purposes of this post, we will briefly discuss just the detection portion].
Dataset Shift Detection
Many techniques exist for dataset detection, with varying degrees of success. In industry, there are two main approaches to dataset shift detection:
- Correspondence Training
- Conceptual Equivalence
Let’s quickly dive into both approaches in greater depth.
Within correspondence tracing, a classifier is constructed related to each individual feature or rule found in the data. This classifier is meant to be an individual, single feature determination. In this manner, a classifier is trained for every feature considered within the data and is done so for both training and test data. A dataset shift is identified by considering how each classifier corresponds on a one-to-one basis for each individual feature.
Conceptual equivalence is a less proven approach that is found to be effective in certain use cases. When utilizing conceptual equivalence, the user derives a lower level representation of the data, with a unique representation for both the target and source domain. The learned representation is then compared with the hopes of more easily identifying a dataset shift at this level. Commonly, autoencoders are used for this purpose, with the autoencoder output being the lower level representation used for comparison.
This post was not meant to discourage the use of multi-domain learning or domain adaptation in any way. Rather, domain adaptation — or the broader field of transfer learning — is one of the most exciting fields within machine learning. Transfer learning can potentially allow us to expand our machine learning capabilities tremendously as we leverage existing data in new environments.
That said, the intention of this post was simply to expand awareness of the risks associated with domain adaptation. When properly conducted, domain adaptation has incredible potential. But it’s important to understand what it takes to properly utilize this methodology. Data-driven decision making is an absolute focus for companies like Capital One, with an imperative to remain innovative while delivering responsible technology solutions with defendable conclusions.
Data will inform the most intelligent decisions and thoughtful selections, however, those same choices must be evidentially supported by the cutting edge techniques one may choose to use. The pathway to machine learning-enabled products and capabilities will eventually involve mastering techniques such as domain adaptation and transfer learning. To achieve this, it is important to arm yourself with the knowledge needed to best apply these techniques within our industry.
These opinions are those of the author. Unless noted otherwise in this post, Capital One is not affiliated with, nor is it endorsed by any of the companies mentioned. All trademarks and other intellectual property used or displayed are the ownership of their respective owners. This article is © 2019 Capital One.