Introducing ADA: Another Domain Adaption Library

TL, DR: Check out our new library on Github, toy around, and scale-up domain adaptation algorithms!

Anne-Marie

Published in

Criteo Tech Blog

9 min readJun 10, 2020

What is domain adaptation?

As you may have read in our recent blog post giving an insight into what we do at the Criteo AI Lab, one of our topics of research is Domain Adaptation, the problem of “seeing expertise acquired in some situations carried over to another situation”.

More formally, given an input space X for features, an output space Y for labels, and a probability distribution on these spaces p, the usual supervised learning assumption is that both the training data and the test data are sampled i.i.d from p. This means that if you split your training data into a train and a validation set, train your model on the train set and evaluate on the validation set, you have a good estimate of the test error. However, in practice, the model may be used on (test) data which is slightly different from the data it has been trained on.

Domain adaptation may be defined as the case when the training data is sampled from a source distribution pS, and the test data is sampled from a different target distribution pT, with no or few target samples labeled. Unsupervised domain adaptation assumes that there are no labeled samples for the target domain. Semi-supervised domain adaptation assumes part of the target samples are labeled, usually only a few.

Several cases are usually studied. If pT(x)≠pS(x) and pT(y|x)=pS(y|x), this is known as covariate shift. If pT(y)≠pS(y) and pT(x|y) =pS(x|y), this is known as label shift or target shift. In the real world, the situation is usually much more complex, label shift and covariate shift can be mixed together, leading to the more general notion of generalized target shift.

For example, the popular MNIST dataset is used in domain adaptation with other datasets that also represent digits. We can say that a “Domain adaptation dataset” is actually a family of datasets, which may be treated either as source or target to simulate real-world use-cases. For instance, you may train your algorithm on MNIST and evaluate it on MNIST-M, a modified version of MNIST, or train your algorithm on SVHN (Street View House Numbers) and evaluate it on MNIST.

*Figure modified from Fig.6 of the famous* *DANN* *paper¹*

As you can guess from the image, the classification task is the same for all the datasets of the family. However, the distribution of the features (images) and labels may be quite different.

If you want to know more and get the latest updates on domain adaptation, I can only recommend this awesome list of resources.

Why another domain adaptation library?

Almost a year ago, when I started working on domain adaptation, I knew close to nothing about this area of research. I had fine-tuned models to transfer learning before, and I had heard a few presentations about it, and that’s it. As I dived into this topic over the last year with the goal of solving new problems in a generalized target shift, one thing I wanted to do is to play around with toy data to gain a better understanding of the different algorithms at hand. Problem: I had to do it all by hand. The only domain adaptation library I could find was salad, which didn’t fit my needs at the time.

As I read the seminal paper by Ganin et al. on Domain-Adversarial Training of Neural Networks (DANN)¹, I wanted to reproduce their figure 2:

This figure helps visualize key ideas from the domain-adversarial adaptation:

a) How the proper alignment of feature distributions in the latent space (as shown by the PCA of the representation) allows the classifier to learn a boundary that better separate the two classes for both source and target data
b) How “proper” alignment of features can be achieved by maximizing the domain classifier error.

Besides salad, I could only find standalone scripts, usually targeted at the Digits dataset, or small libraries that were hard to adapt to my use case. Of course, none would allow me to reproduce this kind of figure.

So I developed ADA, Another Domain Adaptation library, with these 3 key goals in mind:

Visualization: Allow visualizing the algorithms on toy data,
Reuse: Easily reuse the same code to run on the other classical domain adaptation datasets (Digits, Office31, Visda…)
Reproducibility: Have a clean evaluation protocol, where evaluating performance over multiple seeds is the default.

My interpretation of reproducibility in the last point may appear controversial, as it means multiplying the training and evaluation time. However, very soon in my experiments, I observed high variance in the results, to the point that a change of seed could reverse conclusions. Indeed, as most domain adaptation algorithms are based on some kind of mini-max optimization, I found they are very unstable. The clean way to evaluate this is to evaluate the uncertainty of the performance results, which means using confidence intervals computed over several iterations.

After I teamed up with another colleague to beef it up, let me present to you how this turned into Another Domain Adaptation library. Feel free to try it!

Visualization: toying with Streamlit

If you don’t know about Streamlit.io, check it out! I found this is a great way to play with toy data and generate interesting problem settings. Domain adaptation algorithms come with assumptions about the underlying data: including the dependency relations between the features and labels, between source and target, etc. To model these dependencies, I like to take a causal approach, which inspired how I built a toy data generation module. You can play with it in a Streamlit application and, of course, you can reproduce and visualize the example from the DANN paper (figure 2 I was writing about earlier):

Try out all the algorithms with toy examples thanks to Streamlit!

Code reuse: software engineering best practices

The main practice implemented in ADA is “divide and conquer”: code has been factorized as much as possible, keeping independent parts of the code in different places.

After reading a few domain adaptation papers and their implementations, I noticed a recurring pattern. Most methods aim to learn a common representation space for source and target domain, splitting the classical end-to-end deep neural network into a feature extractor with parameters Φ, and a task classifier with parameters θy. Alignment between source and target feature distributions is obtained by adding an alignment term Ld to the usual task loss Lc. This alignment term is controlled by a parameter λ which grows from 0 to 1 during learning.

A very common loss decomposition for domain adaptation.

Three types of alignment terms are implemented in ADA, giving you access to 3 families of methods:

Adversarial methods, similar to DANN, use a so-called domain classifier with parameters θd as an adversary to align the features,
Optimal-transport based methods, in which the domain classifier, called a critic, is trained to minimize the divergence between the source and target feature distributions,
Kernel-based methods, which minimize the maximum mean discrepancy in the kernel space to align features.

In practice, a domain adaptation algorithm will have an architecture with 2 or 3 blocks: the feature extractor (dark-green box), the task classifier (blue box), and optionally a critic (red box). The architecture of these blocks is chosen by the user depending on the dataset (eg ResNet50 except last layer for the feature extractor, and linear classifiers for task and domain classifiers).

A common architecture for domain adaptation algorithm.

As a consequence, we made the following decisions:

The network architecture is configured and built with parameters that depend on the dataset (dimension of the inputs, network type),
A domain adaptation algorithm defines how the 2 or 3 blocks are connected and how their outputs are used to compute the loss function,
The learning schedule for lambda is independent of the algorithm and may be used in the same way for all the methods,
Move as much boilerplate as possible outside of algorithm definitions: iterating over the dataset, logging, checkpointing, moving from CPU to GPU, etc.

Actually, most of the boilerplate is handled by using PyTorchLightning, a library designed just for that, a “lightweight PyTorch wrapper for ML researchers”, which I recommend you check out if you want to implement new algorithms using ADA.

ADA builds on top of PyTorch and PyTorchLightning to bring you features most useful for domain adaptation:

Parallel iteration over the source and target data, handling unsupervised and semi-supervised target data,
Simulation of class imbalance with source and/or target class reweighting,
Built-in domain adaptation learning tricks such as a source “warm-up” stage and hyperparameter schedules.

As of this writing, methods from 3 different families are implemented, namely: DANN, CDAN, WDGRL, DAN, and JAN. For semi-supervised learning, we already have a naive adaptation of DANN, and MME. Please check out the documentation if you want to know what is hidden behind these acronyms 😃!

The separation of concerns between dataset, network architecture, and domain adaptation method is reflected in how an experiment’s parameters are defined, which involves 3 configuration files:

One to describe the dataset to use,
One to describe the network architecture and the corresponding learning hyperparameters, with defaults available for the main datasets,
One to describe the methods to be run, with their parameters.

Default configuration files are available for 3 dataset families:

Toy data: either gaussian blobs or 2 moons,
Digits: defaults to MNIST→USPS, with any combination of MNIST, MNIST-M, USPS, SVHN possible,
Office31: defaults to Amazon → Webcam, with any combination of Amazon, Webcam, DSLR possible.

Reproducibility: In-built evaluation protocol

Something frustrating when reading papers and reproducing experiments is the gap between the evaluation protocol described in the paper and the code. For instance, the authors’ script would output a single number, but the paper gives an average and a confidence interval. How are they computed? On how many runs? This information is rarely found in the code.

Our library provides a function to loop over several runs, using the exact same seed for each method, so that at the end of the warmup you can check that all methods stand equal. We automatically record each individual result, and present the final performance as the average and confidence interval computed over all the runs. Several files are generated in a directory defined by the dataset name (e.g. MNIST → USPS):

A csv file recording each individual result as a row, one csv file per configuration setting,
A markdown file summarizing the results as a table, appending a new result to the file for each configuration,
Optionally, image files showing PCA, TSNE, UMAP plots for high-dimensional datasets, and more for toy datasets.

This is the kind of output you get for free: compare your results in Markdown (as a next step, you can run pandoc to convert it to LaTeX and copy-paste it in your paper). We refer you to the documentation for a description of all the algorithms listed here.

Auto-generated markdown with the results 😄

This sounds great, how can I get it?

Check it out on Github and run:

git clone git@github.com:criteo-research/pytorch-ada.git

install it with pip install -e adalib, go to the script directory and run :

python run_simple.py: to loop over all the methods on a small toy example, add-h to see all the available parameters, use -d ../configs/datasets/digits.json to run on MNIST→ USPS.
pip install streamlit and streamlit run run_toys_app.py to launch a server with the streamlit application and toy around,
python run_full_options.py -h to see the full set of options.

Additionally, you can check out the documentation on pytorch-ada.readthedocs.io and you can contribute your own methods, bug reports, etc on the Github repository. We can’t wait to see your contributions and feedback!

If this library is useful for your research please cite:

@misc{adalib2020,
  title={(Yet) Another Domain Adaptation library},
  author={Tousch, Anne-Marie and Renaudin, Christophe},
  url={https://github.com/criteo-research/pytorch-ada},
  year={2020}
}

Graphics made with the awesome Excalidraw.

[1] Ganin, Yaroslav, et al. “Domain-adversarial training of neural networks.” The Journal of Machine Learning Research (2016) https://arxiv.org/abs/1505.07818

If you enjoyed this article, check out what our team is up to on a day-to-day basis:

An Insight into the Criteo AI Lab

The Criteo AI Lab is pioneering innovations in computational advertising. Take a look inside the teams behind the AI…

medium.com

And if you are interested in contributing to our libraries, head over to our career page, or reach out to us directly!

Product, Research & Development | Criteo Careers

careers.criteo.com