Unsupervised Domain Adaptation by Backpropagation

Yaroslav Ganin, Victor Lempitsky (ICML 2015)

4 min readJun 7, 2018

Learning a discriminator or some type of predictor while there is shift present in the distributions of the source and target distributions is known as domain adaptation.

Blue is the source distribution and red is the target distribution. Source : link

As can be seen from the image above, A classifier learned on the source distribution (blue) won’t perform well on the classifier learned on the target distribution (red). Therefore we need make both this distributions indistinguishable.

Below are some types of example source and target domains.

Problem Statement

Given a labelled source domain and an unlabelled target domain. We would like to train a classifier or a predictor which would give accurate predictions on the target domain.

Assumptions

Probability distribution of source domain is not equal to the probability distribution of target domain. The conditional probability distribution of the labels given an instance from source domain is equal to the conditional probability distribution of the labels given an instance from the target domain. Source dataset is labelled. Target dataset is unlabelled

Goal

Perform some transformation on the source and target domain and bring the transformed distributions closer. Then, Train the classifier on the transformed source distribution and since both the transformed distributions are now similar, The model will achieve better accuracy on the target domain during test time.

In-order to perform the transformation, a neural network is used. Let the network used for performing the transformation be denoted as ‘F’ with neural network parameters ‘W’. Let the instances from source and target domain be denoted as ‘s’ and ‘t’. The transformed vector after performing the transformation F with weights W on the source and target instances be denoted as Vs and Vt.

F(s,W) = Vs and F(t,W) = Vt.

The goal is to bring the probability distributions of Vs and Vt closer or similar to each other. P(Vs) = P(Vt).

Approach

The above goal can be achieved by using following components.

Feature Extractor : This is a neural network that will learn to perform the transformation on the source and target distribution
Label Classifier : This is a neural network that will learn to perform classification on the transformed source distribution. Since, source domain is labelled
Domain Classifier : This is a neural network that will be predicting whether the output of the Feature Extractor is from the source distribution is from source distribution or the target distribution.

Basic Intuition : Basically, the feature extractor will try to perform some transformation on the source and target instances such that the transformed instances appear as if it is coming from the same distribution and domain classifier will not be able to classify the domain of the transformed instances. This is achieved by training both the feature extractor and domain classifier in such a way that feature extractor will be trained to maximize the domain classification loss, while domain classifier will try to minimize the domain classification loss. So, this is similar to adversarial training wherein the feature extractor is trying to confuse domain classifier by bringing the two distributions closer. For the transformed source instances the label predictor will be trained for predicting the labels of the source instances. The feature extractor will therefore be trained to minimize classification loss of the label predictor and maximize classification loss of domain predictor. the label predictor and domain predictor will be trained to minimize there respective classification loss.

Thus using the above three components, The Feature extractor will learn to produce discriminative and domain-invariant features.

Gradient Reversal Layer

For training the feature extractor in order to maximize the classification loss of domain predictor, Gradient Reversal layer was place between Feature extractor and domain classifier. The Gradient Reversal Layer basically acts as an identity function (outputs is same as input) during forward propagation but during back propagation it multiplies its input by -1. Intuitively, During back propagation the output of GRL is basically leading to the opposite of Gradient descent that is performing gradient ascent on the feature extractor with respect to the classification loss of Domain predictor.

Results

Pros

Brings both domains closer after performing some transformation
Learns discriminative and domain invariant features

Cons

Gradient Reversal Layers leads to vanishing gradient problem once the domain predictor has achieved good accuracy
Same weights are being used for performing the transformation on the source domain and the target domain. Since, both source and target domain might have different features therefore, shared weights might lead to less number of parameters for learning independent features and transforming both the distributions.