GSoC 2021 with ML4Sci: Domain Adaptation for Decoding Dark Matter
--
Hey! In this post I’m going to talk a little bit about my experince on Google Summer of Code (GSoC) 2021 and also the GSoC 2021 project that I developed under the Machine Learning for Science (ML4Sci) umbrella organization. You can check out more about the ML4Sci organization and its others GSoC projects on ML4Sci’s website.
In my project I implemented different Unsupervised Domain Adaptation algorithms for the DeepLense pipeline.
DeepLense and gravitational lenses
DeepLense is a deep learning pipeline that combines state-of-the art of deep learning models with strong lensing simulations. In this context, my project aims to continue the work published by my mentors in “Decoding Dark Matter Substructure without Supervision”, where the team implemented unsupervised machine learning techniques to infer the presence of dark matter substructure in strong gravitational lenses. Pranath Reddy provides a great explanation of this project in his blog post.
A gravitational lens occurs when we have a galaxy or (other massive object) between us, the observers, and a source object. Due to the mass of the intermediate object, it is capable of bending the light from the source, creating an effect very similar to the lenses used on our day-to-day life, where the source object now appears to be in another location (image).
But how does this relate to dark matter?
Dark matter is a form of matter that generates mass but doesn’t interact with electromagnetic forces. That means that it doesn’t interact with light, making it very difficult to be identified and studied. Luckily, a promising means to identify the nature of dark matter is to study it through dark matter halos, and strong gravitational lenses have seen encouraging results in detecting the existence of dark matter substructure!
Unfortunately, there isn’t a lot of data of strong gravitational lenses available, which means that, if we want to train a machine learning model to identify the different kinds of dark matter substructure, we’d need to use simulations. The problem though, is that a model trained on simulated data does not generalize well to real data, having a very bad performance.
This project aims to fix this problem by using Unsupervised Domain Adaptation (UDA) techniques to adapt a model trained on simulated data to real data!
A bit about me
I’m Marcos Tidball (nice to meet you!), a junior Physics student at the Universidade Federal do Rio Grande do Sul (UFRGS) in Brazil (I’m not good at soccer though). Previous to GSoC I was a research intern developing a method that uses convolutional neural networks to identify low surface brightness galaxies. And pretty soon I’m going to start an internship at BTG Pactual, the largest investment bank in Latin America.
I was interested in this project as soon as I read about it! Since I was already working on machine learning models for astronomical data (and suffering from the lack of accuracy when attempting to use a model trained on simulations to real world data), it seemed like the perfect fit to me!
The code
All my code is available at the ML4Sci DeepLense repository. While you’re there you can also check out the code and projects of the other students that contributed to DeepLense!
The data
The dataset I used throughout my project consists of simulated strong gravitational lens images generated with PyAutoLense. The parameters of these simulations can be found in “Decoding Dark Matter Substructure without Supervision”.
In this dataset there are three classes:
- No substructure: gravitational lenses simulated without dark matter.
- Spherical substructure: gravitational lenses simulated with subhalos of cold dark matter.
- Vortex substructure: gravitational lenses simulated with vortices of superfluid dark matter.
As a proof-of-concept, we don’t use simulated and real data for the domain adaptation. We use “Model A” simulations for the source domain (what we train on) and “Model B” simulations for the target domain (what we want to adapt to). Model B’s simulations are more complex and more representative of real-world data, while Model A’s are easier. In more practical terms, Model B is simulated with a variable redshift and signal-to-noise ratio while Model A has these parameters fixed.
The dataset has 30'000 grayscale images of size 150x150 for each domain. All images are stored as NumPy arrays. More information about the dataset is available on the repository.
Here are some of the images that come from the source dataset:
Unsupervised Domain Adaptation
Unsupervised domain adaptation is a problem in which one attempts to transfer knowledge gained from a labeled source dataset to a distinct unlabeled target dataset, within the constraint that the objective (e.g. digit classification) must remain the same. One of the most common baseline datasets for this kind of technique is the VisDA2017 dataset:
I have studied and implemented many UDA models. The choice of which model to be implemented was made after comparing their results on baseline datasets. The intention was to implement some of the best performing algorithms. In this section, I’ll be talking about the four models I’ve implemented: ADDA, Self-Ensemble, CGDM and AdaMatch.
Preparations
In order to use the algorithms I’m going to use the package I created from this project: deeplense_domain_adaptation. The first step is to download it!:
pip install --upgrade deeplense_domain_adaptation
After that, we must also define the path for our data. In our dataset, the image data is separated from the label data. As such, we’ll define the path to both our source and target datasets as:
# source domain: model_f
model_f_train_data_path
model_f_train_labels_pathmodel_f_test_data_path
model_f_test_labels_path# target domain: model_j
model_j_train_data_path
model_j_train_labels_pathmodel_j_test_data_path
model_j_test_labels_path
Now we can start looking at the methods and also how to use the deeplense_domain_adaptation package in order to train these algorithms!
ADDA
ADDA (from “Adversarial Discriminative Domain Adaptation” by Eric Tzeng, Judy Hoffman, Kate Saenko, Trevor Darrell) is an adversarial domain adaptation method, where the goal is to minimize the domain discrepancy distance through an adversarial objective with respect to a discriminator.
We want the discriminator to be unable to distinguish between the source and the target distributions!
ADDA learns a discriminative representations using the labels in the source domain and then a separate encoding that maps the target data to the same space. Our goal is to fool the domain discriminator so that it is unable to distinguish the source from the target.
In order to train ADDA, we must use the encoder and classifier trained on the source as initial inputs. While this is the only algorithm that needs this kind of transfer learning, this technique is beneficial to the other algorithms.
In order to use data we must first load the data:
from deeplense_domain_adaptation.data import augmentations
from deeplense_domain_adaptation.data.dataset import get_dataloader# get ADDA transforms
train_transform_source, train_transform_target, test_transform = augmentations.adda_augmentations()# load data
bs = 100source_dataloader = get_dataloader(model_f_train_data_path, model_f_train_labels_path, train_transform_source, bs)
source_dataloader_test = get_dataloader(model_f_test_data_path, model_f_test_labels_path, test_transform, bs)target_dataloader = get_dataloader(model_j_train_data_path, model_j_train_labels_path, train_transform_target, bs)
target_dataloader_test = get_dataloader(model_j_test_data_path, model_j_test_labels_path, test_transform, bs)
Then we can instantiate the network architectures:
from deeplense_domain_adaptation.networks import resnet
from deeplense_domain_adaptation.networks import discriminator# the source_encoder and classifier should be pre-trained on source
source_encoder = resnet.Encoder('18')
target_encoder = resnet.Encoder('18')
classifier = resnet.Classifier()
discriminator = discriminator.Discriminator()
And finally train it!
from deeplense_domain_adaptation.data import hyperparams
from deeplense_domain_adaptation.algorithms import adda# get hyperparameters
hparams = hyperparams.adda_hyperparams()# instantiate ADDA
adda = adda.Adda(source_encoder, target_encoder, classifier, discriminator)# train ADDA
epochs = 100
save_path = "./adda.pt"
encoder, classifier = adda.train(source_dataloader, target_dataloader, target_dataloader_test, epochs, hparams, save_path)
Then we’re able to plot the training metrics and also evaluate ADDA on the test dataset:
# plot training metrics
adda.plot_metrics()# evaluate on test set
## returns accuracy on the test set
print(f"accuracy on test set = {adda.evaluate(target_dataloader_test)}")## returns a confusion matrix plot and a ROC curve plot (that also shows the AUROC)
adda.plot_cm_roc(target_dataloader_test)
The confusion matrix in this case is:
And the ROC curve plot is:
This pipeline is used for all algorithms. If you’re interested in checking out more about the usage of deeplense_domain_adaptation check out the tutorial on the repository!
Self-Ensemble
Self-Ensemble (from “Self-ensembling for visual domain adaptation” by Geoffrey French, Michal Mackiewicz, Mark Fisher) is based off of the mean teacher model used in semi-supervised learning. The mean teacher model has two networks: a student (trained with gradient descent) and a teacher (weights are an exponential moving average of the student’s weights).
You can see the training pipeline of this model in the following image:
CGDM
CGDM (from “Cross-Domain Gradient Discrepancy Minimization for Unsupervised Domain Adaptation” by Zhekai Du, Jingjing Li, Hongzu Su, Lei Zhu, Ke Lu) is a bi-classifier adversarial learning method.
CGDM minimizes the discrepancy of gradients generated by source and target samples. To compute the gradients of the target samples, it uses a clustering-based strategy to obtain more reliable pseudo-labels. It then uses self-supervised learning on the pseudo-labels in order to optimize the model with data from the source and the target domain.
Given its bi-classifier nature, after training CGDM we’ll obtain an encoder and two classifiers, that are used together when evaluating a new data point.
AdaMatch
AdaMatch (from “AdaMatch: A Unified Approach to Semi-Supervised Learning and Domain Adaptation” by David Berthelot, Rebecca Roelofs, Kihyuk Sohn, Nicholas Carlini, Alex Kurakin) is a novel method that unifies UDA with semi-supervised learning and semi-supervised domain adaptation.
This method augments each image twice, once with a weak augmentation and once with a strong augmentation. From those images we’re able to extract logits, that are randomly interpolated. It then performs a distribution allignment to find target pseudo-labels.
Current results
In order to train my models I used early stopping with a patience of 15 epochs in order to save computational resources. I also use the model with the best accuracy on a validation set for inference. All models were trained for 100 epochs using Kaggle’s GPUs using a ResNet18 network as their backbone. The current best results for each method are:
Meanwhile, if we try use a model that was trained only on the source dataset to infer on the target dataset, we get:
We can conclude that UDA actually helps a bunch!
Future work and final thoughts
Though current results show a clear increase in the accuracy of our models over not applying any kind of domain adaptation, there’s still a lot of room for improvement. As seen in “DeepMerge II: Building Robust Deep Learning Algorithms for Merging Galaxy Identification Across Domains”, even though a model + UDA algorithm might perform well on a easier simulation to harder simulation adaptation, it’s hard to get a large boost in performance when adapting to real data.
The first step in my list of priorities is to test using equivariant neural networks, since this architecture achieves promising results in classification tasks related to gravitational lenses. Also trying out other convolutional neural network architectures such as ResNet50 and EfficientNet could prove very beneficial.
Another very important step is to make more thorough hyperparameter searches. Though time-consuming, it’s extremely important to find the best hyperparameters in order to achieve good results. Since two of the UDA algorithms we use are very dependent on the augmentations used (Self-Ensemble and AdaMatch), further exploration of augmentations is also very important.
Of course, keeping an eye out for novel UDA algorithms is also always a good practice to keep in mind, especially since this field is so hot right now!
All in all I got to say that this project was the best opportunity that I’ve gotten so far in my professional career. Being able to work with my mentors while exploring the field of UDA was amazing! And I honestly cannot even begin to fathom the amount of opportunities that will open up thanks to this project :)
I’d like to thank Pranath Reddy, Michael Toomey, Sergei Gleyzer, Anna Parul and Sourav Raha for helping me out and mentoring me during the program!
And finally, thank you Google and the people organizing Google Summer of Code for this amazing opportunity!