Ch 8. Adversarial Discriminative Domain Adaptation (ADDA): Quest for Semantic Alignment

Optimizing domain adaptation through toggling data annotation, training frameworks, and pre-training datasets

Lucrece (Jahyun) Shin
14 min readDec 23, 2021

In this post, I will introduce the concept of domain adaptation in machine learning and discuss the process of optimizing Adversarial Discriminative Domain Adaptation (ADDA) framework. Here is the table of contents:

  1. Motivation for Domain Adaptation — Domain Shift
  2. Goal of Domain Adaptation — Semantic Alignment
  3. Web → Xray Domain Adaptation
  4. ADDA — Algorithm
  5. Quick Review of Multi-labels
  6. Experiment #1: Fine-tuning ResNet50 pre-trained on ImageNet with web (source) domain only
  7. Experiment #2: ADDA with encoder pre-trained on ImageNet
  8. Breaktime: What is the ROOT of Web → Xray Domain Shift?
  9. Experiment #3: ADDA with encoder pre-trained on Stylized+Original ImageNet
  10. Domain Adaptation: Perspectives

1. Motivation for Domain Adaptation — Domain Shift

Take a look at this interesting observation:

Classification accuracies of 4 CNN architectures and humans for classifying the images as “cat” 🐱 (Source: https://arxiv.org/abs/1811.12231)

This result presented in ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness (2019) shows that while most humans can easily recognize all four images as a cat🐱 despite the shifts in style, performance of all four CNN-based image classification models, AlexNet, LeNet, VGG16, and ResNet50, fall sharply for the cat depicted in silhouette and edges style.

In research, different styles/textures are referred to as different “domains” of images. Since neural networks are highly sensitive to the distribution of incoming data, an image classification model trained only on a single domain of images (called “source domain”) will learn to encode images considering the discriminative properties of that specific domain only. Thus the same model is likely to perform poorly when tested on another domain of images (called “target domain”), as observed in the figure above. This issue is known as Domain Shift.

2. Goal of Domain Adaptation — Semantic Alignment

2.1 Input Space and Feature Space

How can we minimize domain shift and adapt the model to generalize well on a new target domain? This task, quite intuitively called Domain Adaptation, can be achieved by looking at two places within a deep learning model:

  1. Input space — we can collect or synthesize sufficient amount of input data in target domain to be used to train/fine-tune the model.
  2. Feature space — we can encourage the model to map input data from different domains but of the same class close together in feature space (task called Semantic Alignment).

2.2 Semantic (Feature) Alignment

In case target domain is niche or unexplored in research, collecting enough data may be impossible or expensive due to scarcity of data. For that reason, much research in domain adaptation is conducted in feature space, aiming to achieve Semantic Alignment, as illustrated below:

t-SNE visualization of VisDA-2017 dataset using ResNet101 before and after domain adaptation with Drop to Adapt framework; t-SNE hyperparameters are consistent in both visualizations. (Source: Drop to Adapt Paper)

Presented in Drop to Adapt: Learning Discriminative Features for Unsupervised Domain Adaptation (2019), the two plots show feature representations of VisDA-2017 dataset (12-class image classification with synthetic 3D model images as source domain and real photographic images as target domain) before and after domain adaptation. Before domain adaptation (left), source domain features (red) are shown as 12 separated clusters, while target domain features (blue) appear as one big blob. After domain adaptation (right), target domain features show much better separation. Also, although different classes are not labelled in the plots, some pairs of red and blue clusters appear nearby each other (indicated by cyan rectangles), which could represent clusters of the same class, illustrating semantic alignment.

Now, I will give you an example of a real-world domain adaptation between normal camera domain and Xray camera domain. Let me briefly introduce the project and why I decided to use domain adaptation approach.

3. Web → Xray Domain Adaptation

Samples of the three classes from web (source) and Xray (target) domains

3.1 Given Task

For my masters research project at University of Toronto, I was asked to perform automatic threat detection for airport Xray baggage scanner, i.e. given an Xray scan image that look like ones above, detect any gun or knife if present.

3.2 Given Dataset

An international airport provided me with 450 Xray baggage scan images with 3 classes: gun (117 images), knife (33 images), and benign/not harmful (300 images). But the issue was that the number of given images was not enough to train a neural network without overfitting.

3.3 Suggested Research Path

My research supervisor advised me to develop a deep learning model, considering the convolutional neural network’s breakthrough performance in computer vision. In particular, he suggested me to work with image classification objective (classifying the whole image as a class), as opposed to object detection (predicting bounding boxes around objects) or object segmentation (classifying each pixel as belonging to a class or not), in order to keep the model complexity moderate. He also wanted me to take domain adaptation approach, considering that there were not enough Xray images to train a neural network without overfitting. This approach involved first collecting a large amount of non-Xray, stock photo-like images of the same object classes from the web, using them to train the model, the adapting the model to generalize well on Xray images as well. Finally, PhD students in my research group suggested starting with Adversarial Discriminative Domain Adaptation (ADDA) framework, for its relatively simple-to-implement yet powerful algorithm.

For more detailed project background, please refer to this project introduction post and the list of my project posts.

4. ADDA — Algorithm

Adversarial Discriminative Domain Adaptation (ADDA) framework (2017) introduces an effective unsupervised (meaning that target domain data is unlabeled) domain adaptation framework to reduce the difference between source and target domain distributions and thus improve generalization performance”.

4.1 GAN vs. ADDA

With “adversarial” and “discriminative” terms in ADDA’s name, you might be reminded of Generative Adversarial Network (GAN) (2014). Let’s compare the two:

GAN — Image Generation

  • Input, Output — latent vector z, fake image generated by G
  • Generator G (deconvolutional layers) — maps 1D latent vector z into 3D fake image
  • Discriminator D (convolutional layers + fully connected layers) — maps 3D image into real or fake binary label (i.e. 0 for fake, 1 for real)
  • D’s objective Classify input x as real and G’s output as fake
  • G’s objective Confuse D to classify G’s output as real

ADDA — Image Classification

  • Input, Output — image x, class label c
  • Encoder E (convolutional layers; usually a strong backbone, e.g. ResNet50) — maps a 3D image into a 1D feature vector
  • Classifier C (fully connected layers) — maps E’s 1D feature vector output into class label c
  • Discriminator D (fully connected layers) — maps E’s 1D feature vector output into binary domain label d (1 for source domain, 0 for target domain)
  • D’s objective — Classify source and target domain features as their real domain label (1 for source domain, 0 for target domain)
  • E’s objective — (1) Encode input images in class-discriminative manner and (2) Confuse D to classify source and target domain features as their fake domain label (0 for source domain, 1 for target domain), in order to produce source and target domain features that are indistinguishable in feature space
  • C’s objective — Discriminate between different classes (e.g. cross entropy loss)

4.2 ADDA — Asymmetric Mapping

Sequential training with asymmetric mapping (original ADDA)

ADDA paper proposes having two separate encoders for mapping source and target domain images. As shown in the figure above, classification and domain adaptation tasks are performed sequentially, one after another. First, the source encoder is pre-trained on the class labels of the labeled source images. Next, a target encoder with identical architecture as source encoder is initialized with source encoder’s pre-trained weights, then gets trained using binary (source vs. target) domain labels while source encoder weights are frozen. Since ADDA performs unsupervised domain adaptation, it assumes that target domain data is unlabeled and does not optimize for class-classification on target domain.

The paper suggests that such asymmetric mapping between source and target encoders is more flexible, as it allows more domain-specific feature extraction to be learned.

4.3 Modified ADDA — Symmetric Mapping

Parallel training with symmetric mapping (modified ADDA)

I slightly modified ADDA to have a single encoder for mapping images from BOTH source and target domains. This eliminates the pre-training stage of the source encoder. As shown in the figure above, a single encoder is simultaneously trained for classification (only using the source domain images and class labels), and domain adaptation (using both source and target domain images + binary domain labels) in a single epoch.

How I came up with such modification is that a survey paper about deep learning application in Xray security imaging (2021) reported that a majority of recognized adversarial discriminative models for domain adaptation were using symmetric mapping.

Comparison of different adversarial discriminative models, where ‘En’ is short for Encoder. ‘shared’ means symmetric mapping with a single encoder sharing weights for both source and target domain, while ‘unshared’ means asymmetric mapping with two separate encoders. Highlighted in yellow is the ADDA paper. (Source: Survey paper on unsupervised domain adaptation, 2021)

I also observed better performance when using symmetric mapping. This could be because for asymmetric mapping, source encoder’s pre-trained weights might already be too biased for mapping source domain images according to their class labels, serving as a suboptimal starting point for the target encoder. Since symmetric mapping allows optimizing for classification and domain adaptation simultaneously in one training epoch, the encoder can adjust weights considering both tasks.

4.4 ADDA — Reported Performance

(Left) MNIST, USPS, and SVHN samples and ADDA experimental results (Right) Office dataset samples and ADDA experimental results (Source: ADDA paper)

Above table shows ADDA performance for digit recognition and office object recognition tasks, which are far better than “Source only” model that was trained with source domain data only, and comparably better than previous domain adaptation frameworks. The paper doesn’t mention semantic alignment; however, I will later show how ADDA achieves it for my own problem.

A step-by-step PyTorch implementation of ADDA training (plus functions for t-SNE plotting and defining multi-label datasets) is included in my Colab Notebook. Next, I will discuss my domain adaptation experiments performed in chronological order. At each step, I will present the resulting t-SNE plots of source and target domain features to check for semantic alignment.

5. Quick Review of Multi-labels

Xray images and web images containing knife and gun

As shown above, most Xray scan images contain other benign (i.e. not harmful) objects cluttered with gun or knife, while most web images show an isolated object. So the model can easily pay attention to other objects in Xray images besides gun or knife. With this in mind, annotating each image with the standard single label (class 0, 1, or 2) does not allow the model to predict that there are more than one classes present in the image, e.g. a gun and other benign objects. To fix this, I assigned a multi-label for each image:

Three different types of images and respective target labels

To note, I discovered the model’s tendency to classify images as benign class with higher confidence compared to gun and knife classes (quite intuitively, as benign class represents the universe besides gun and knife🤨). Since detecting benign objects was not as important as detecting gun or knife, I weakened the benign signal by giving benign class a “soft” label of 0.5 while keeping others as 1. More details on multi-labels are discussed in my previous post about data optimization.

6. Experiment #1: Fine-tuning ResNet50 pre-trained on ImageNet with web (source) domain only

For my initial experiment, I downloaded ResNet50 with weights pre-trained on ImageNet dataset and fine-tuned it on web (source) domain images for both single-label and multi-label cases. No Xray (target) domain images were used in the process. For both cases, model recalls for source domain images reached 0.99+ in less than 10 training epochs. In contrast, Xray (target) domain recalls were poor due to domain shift:

Recall table for Xray images (v1)

Recalls for multi-label case, although twice higher than the single-label case, are still far lower than the desired 100%. Below are the t-SNE plots of source and target domain features encoded by the fine-tuned ResNet50 with multi-label data, with colour labels by domain (left) and domain+class (right).

t-SNE plot of source-only, multi-label model features, distinguished by domain (left) and by domain+class (right)

The left plot shows a similar pattern with Section 2.2’s left plot, where web (source) domain features in red are tightly clustered by class (gun and knife), while Xray (target) domain features in blue appear as one blob. The right plot shows that the model is far from achieving semantic alignment. We do see Xray gun features (yellow) gearing towards web gun features (red), but they are still more closely attached to the Xray knife and Xray benign features. Xray knife features (pink) are scattered randomly with no sign of getting close to the web knife cluster (cobalt blue). This signifies that the model is not fully able to look pass the Xray texture and detect the objects it was trained to detect within the Xray images.

7. Experiment #2: ADDA with encoder pre-trained on ImageNet

Next, I trained with the ADDA framework. I first assigned the encoder to have the same architecture as ResNet50 and initialized it with weights pre-trained on ImageNet dataset. Then I trained the encoder using (1) web images and their class labels for classification and (2) both web and Xray images along with 0 vs. 1 domain labels for domain adaptation. Since ADDA performs unsupervised domain adaptation, the labels of Xray images were never used. Below are the resulting t-SNE plots of the encoder’s features at several intermediate training epochs: (epoch 1, 5, 10, 15, 21)

t-SNE plots of ADDA encoder (pre-trained on ImageNet) features, distinguished by domain (top) and by domain+class (bottom) at training epochs 1, 5, 10, 15, 21

At epoch 1, we see a single blob of Xray features in blue with all three classes clustered together. As ADDA training progresses; however, the single blob starts to break apart. Xray knife features (pink) starts migrating towards web knife features (cobalt blue), while Xray gun features (yellow) starts migrating towards web gun features (cobalt blue). At epoch 21, Xray gun and Xray knife features are seen at a small distance from Xray benign features, which is a huge improvement in achieving semantic alignment! Such qualitative improvement is also reflected in increased gun and knife recalls:

Updated recall table for Xray images (v2)

8. Breaktime: What is the ROOT of Web →Xray Domain Shift?

Despite such improvement, 78% and 70% recalls for gun and knife are still far from ensuring flight safety. We already tried ADDA, what can we do next?🤔 I tried training several other domain adaptation frameworks introduced more recently (such as Drop to Adapt, Domain Mixup, DADA) with my dataset, but couldn’t obtain any better results. So instead, I thought more about the root of web → Xray domain shift problem.

Samples of Web images and Xray images containing gun and knife

Looking at web and Xray images, I could point out two major differences:

  1. Texture shift —Xray images had limited colours, increased transparency, and slight blurriness compared to web images
  2. Level of object clutter — Xray images contain different objects cluttered together, compared to web images with a clear presence of the main object

But how come us humans can detect gun or knife in Xray images despite such difficulties? Perhaps we clearly remember the SHAPE of gun and knife and try to locate it in Xray images. Thinking about this led me to the paper introduced in the beginning of this post: ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness (2019).

This paper proposes Texture Hypothesis, which states that “object textures are more important than global object shapes for CNN object recognition. Local information such as texture may actually be sufficient to ‘solve’ ImageNet object recognition”. This is a fatal flaw in CNN models, especially relatable to my problem where a huge texture shift takes place from web image to Xray images. So in order to make the model more sensitive to the shape of objects instead of texture, the paper suggests pre-training the model with stylized images. “Stylizing” an image means keeping the content/shapes in the image, while replacing the style/texture of the image from a randomly selected painting from Painter by Numbers dataset (which contains 79,434 paintings) using AdaIN style transfer. Here’s an example of an image stylized with ten different paintings:

10 stylized samples of an image of a ring-tailed lemur. The samples have content/shapes of the original image on the left and style/texture from 10 different paintings (Source: https://arxiv.org/abs/1811.12231)

The paper reports that the model pre-trained on BOTH the stylized and original ImageNet datasets then fine-tuned on the original performed the best. The model checkpoints and download instructions are available in the paper author’s Github repository.

9. Experiment #3: ADDA with encoder pre-trained on Stylized+Original ImageNet

So I trained with ADDA again, setting the encoder architecture same as ResNet50 and initializing it this time with weights from the best-performing model mentioned above. Here are the final resulting t-SNE plots :

t-SNE plot of ADDA encoder (pre-trained on Stylized+Original ImageNet) features by domain (left) and by domain+class (right)

Do you spot the improvements from previous experiments? For a better perspective, here are the plots from the three experiments:

t-SNE plots of encoded features by domain (top) / by domain+class (bottom) for three different experiments

Looking at the bottom three plots, from left to right, we see progress in following aspects:

  • Separation of Xray gun and Xray knife features (yellow and pink) from Xray benign features (cyan)
  • Migration of Xray gun features (yellow) towards web gun features (red)
  • Migration of Xray knife features (pink) towards web knife features (blue)

Such progress shows that this model has achieved semantic alignment. The model is now able to look pass the Xray texture and detect gun and knife despite the texture shift from normal camera to Xray. Following table shows the increased gun and knife recalls for the final model:

Updated recall table for Xray images (v3)

10. Domain Adaptation: Perspectives

In summary, I considered various perspectives for domain adaptation, including the model training framework, pre-training data, qualitative metric (semantic alignment), and quantitative metric (recalls). I conveniently divided up my research process in three different experiments for this post, but in fact hundreds of experiments were performed in between (plus thousands of thought experiments in the shower) to test, to name a few, the effectiveness of multi-labels, the importance of learning object shapes, and the effectiveness of t-SNE plots in checking for semantic alignment (each elaborated in my other posts as linked). Also note that the ADDA paper never mentions semantic alignment, but I recognized it as the main goal for domain adaptation after digging up many other research papers on domain adaptation.

As machine learning practitioners, we often aim to find a research paper that solves a similar problem with our own, implement and run the algorithm with our data, and boom! accuracies reached high and problem solved. But if the results are not good enough, we must diligently look for different perspectives to look at the problem. I figured that my data might have very different characteristics from the datasets used in the ADDA paper, and looked for ways to accommodate for that difference.

Again, a step-by-step PyTorch implementation of ADDA training (plus functions for t-SNE plotting and defining multi-label datasets) is included in my Colab notebook. You can contact me for any questions or feedbacks 😊. Thanks for reading and happy machine learning! 🦋🦋

--

--