Contradistinguisher for Unsupervised Domain Adaptation (CUDA) in a Nutshell

A model that deals with certain drawbacks of domain alignment involved in various domain adaptation methods alongside giving the State of the art results.

Published in

Towards Data Science

7 min readMay 20, 2020

As we know, in the deep learning world, we might not have sufficient supervised data for training our model. So domain adaptation is a very useful topic to look for. I recently read an ICDM’19 paper “Contradistinguisher for Unsupervised Domain Adaptation” which proposed a direct way for an unsupervised domain adaptation without the need of domain alignment.

In this blog, I will try to give my understanding of this paper in a nutshell. The authors of the paper also released an implementation for this paper.

Domain Adaptation

As we know that in a deep learning world, we might not have sufficient supervised data for training our model. But there may exist another set of data quite similar (having different distribution), for which we have an adequate amount of labeled data present. In that case, it would be great if we can somehow use the model trained on this data for our task. To get a feel of what we are taking about, let’s discuss a small example:

Suppose we have a sentiment analysis task in our hand and we were given the labeled reviews of books for training purposes. Let us have a look at one of the book reviews.

Review-1: “This book has an in depth concept building material for the topics it covers. Totally worthy of your time.”

Now suppose we have to do sentiment analysis task where the reviews are that of furniture. Let us have a look at one of the furniture reviews also.

Review-2: “This furniture has great edge finishing and are very comfortable.”

What we can deduce here is that no furniture review will resemble Review-1, i.e. the sentiment analysis engine been trained on book reviews will work with uncertainty for furniture reviews due to the difference in underlying distribution.

Another similar type of example can also be seen in our classification task on visual data, where we have two different images of same object but both of them comes from different datasets (As given in Fig 1).

Fig 1: On the left is an image of a bicycle shown on an eCommerce website, on the right is an image of a bicycle taken from a mobile camera.

Going back to our previous example, suppose we have our furniture reviews unlabeled, in that case it would be useful if we could somehow use our labeled book review datasets. The sub-discipline which deals with these type of domain transferring is termed as Domain Adaptation. Before going further, we need to understand the meaning of Domain.

Domain (D) of any set of data is stated by its three properties: Its input feature space (X), label space (Y) and its underlying joint probability distribution.

Assumption :

Domain adaptation consists of two distinct domains i.e. source and target domains with common input and output space having a Domain shift. Domain shift is the difference in the joint probability distribution of two domains having common input and output feature space.

Motivation

The general trend for domain adaptation methods is to first achieve an intermediate goal of source and target domain alignment, but there are some drawbacks to it. Large domain shifts make it harder to align the domains. Another drawback is the use of multi classifiers, which increases the model’s complexity. Domain alignment sometimes leads to loss of domain-specific information as the domains are being transformed in the process.

This paper proposes a method which skips this extraneous intermediate alignment step and learns a single classifier jointly in supervised as well as unsupervised manner over source and target data respectively. The author got the inspiration of approaching the problem in one go from Vapnik.

Vapnik’s Idea: Any desired problem should be solved in most direct way rather than solving a more general intermediate task.

This paper specifically discusses the unsupervised version of domain adaptation where the training data contains labeled source domain instances and unlabeled target domain instances.

OBJECTIVE: To learn a classifier on unlabeled target domain by transferring the knowledge from labeled source domain.

Contradistinguisher (CUDA)

Fig 4: Common decision boundary for both domain distributions.

This model acts as a classifier for both the domain distributions. The basic idea behind this approach is to tune the decision boundary learnt on the source labeled data in a way that it follows the target domain prior.

We have two distributions in our hand. Source distribution can be learned using supervised learning. But we need target distribution for building a target domain classifier. In order to tune decision boundary learnt on source distribution, we try to come up with target joint distribution as a function of source distribution enforcing the target domain prior.

Fig 5: Decision boundaries before and after including target unsupervised learning.

CUDA model contains an encoder and a classifier which is trained using three datasets simultaneously.

Finding source joint distribution

As per the definition of domain adaptation, in order to transfer knowledge, the model is trained using source labeled data. Cross entropy loss is used to learn the parameters for maximizing the conditional log likelihood of labels w.r.t. to their respective source instances. Let source distribution be P(θ).

Fig 6: Focusing towards learning from labeled source data.

Devising target joint distribution

Since the only difference between source and target domains is its intrinsic joint probability distribution and using the supervised learning we got the parameters following joint probability of source domain, our task now shifts on getting joint distributions for the target domain.

As we know, any domain joint distribution satisfy the given two condition i.e., marginal over x[t] should give p(y[t]) and marginal over y[t] should give p(x[t]). Proposed target distribution must also follow these two conditions. Since the learning being done on target data can only be unsupervised, the author cleverly proposes target joint distribution which:

Is a function of P(θ) as well as target domain prior.
Follows the above two conditions.
Can be used to find the most contrastive features.

To get the most contrastive features, proposed joint distribution is maximized. Proposed distribution enforces two properties when maximized. It tries to increase the probability of an instance to be labeled to a class and simultaneously decreasing the probability of all other instances to be labeled to the same class. These two enforcing makes the contradistinguisher extract the set of features that uniquely specifies each instance.

Fig 7: Focusing towards adapting from unlabeled target data.

Regularization

A problem that might arise when using pseudo labels during training is over-fitting to pseudo labels. To tackle this issue, author also included adversarial regularization whose goal is to include the uncertainty when the model is introduced with a fake negative set of samples which do not belong to either of the classes. We first build a set of fake negative samples and label them to each of the classes simultaneously. During training, the model will learn to include uncertainty to the prediction which is a valid scenario in our case.

This method is akin to entropy regularization, where instead of minimizing the entropy for real target samples, we minimize the binary cross entropy loss of multi-class, multi-label classification task. Method used to generate fake samples depends on the domain being used. This method proposes two types of fake sample generation :

For a language like domain, the fake samples are randomly sampled using Gaussian or uniform distribution.
For an image like domain, the fake samples are sampled using a generator network which is also trained during the overall model training.

Fig 8: Focusing towards including adversarial regularization.

The author also proposed its further extension to multi-source domain adaptation, where we can have multiple source domains and one target domain. The only changes required here is the source supervised loss. Source supervised loss will now be the summation of all individual source supervised loss.

Results

The author did a detailed experiment on various types of data. This paper presented the results over low resolution and high resolution visual datasets, proposed model gave state-of-the-art results on all of them. They also tried language datasets and here too the model gave state-of-the-art results on most of them. To get a detailed understanding of the datasets used and result corresponding to them, you can have a look at the paper.

Conclusion

The method of jointly learning a contradistinguisher as proposed in this paper is dealing with the drawbacks of the domain alignment without trading off the efficiency. The method gives state-of-the-art results on various scenarios it has been tested on. It has the potential to be the milestone in the domain adaptation field.