Symmetric Heterogeneous Transfer Learning

Published in

researchsummer

11 min readOct 4, 2019

Transfer learning is a research problem in machine learning that focuses on storing knowledge gained while solving one problem and applying it to a different but related problem. It can also be applied while solving the same problem using different but somehow related domains. Let us consider the source and target domains which differ in feature spaces representations. We would like to somehow unify the space using feature-based approaches to transfer learning.

Symmetric heterogeneous transfer learning aims to create common latent feature space for both domains. This latent space can be then used for learning of classifier. Presence of labels can be considered in both domains. The topic is close to representation/feature learning.

The aim of this research work is to experimentally apply state-of-the-art methods based on generative adversarial networks (GANs), and variational autoencoders (VAEs) to the symmetric heterogeneous transfer learning task and modify them in the way that the latent representation will be the best for the learning of the classification task.

“The coolest idea in deep learning in the last 20 years.”
— Yann LeCun on GANs

Background

A lot of computer vision problems can be considered as image-to-image translation problems, mapping an image in one domain to a corresponding image in another domain. For example colorisation is one of those problems as it is necessary to map a grey-scale image to a corresponding color image. Those translation problems can be studied in supervised and unsupervised settings. Supervised setting offers pairs of images in corresponding domains which differs from unsupervised setting where the lack of this pairing causes significant increase in the task’s difficulty, but the usability is remarkably higher. We are focusing on the latter.

The key for this approach is to find a joint distribution for images in different domains with marginal differences. According to the coupling theory infinite set of joint distributions exists so this problem can have numerous solutions.

We focused our attention on UNsupervised Image-to-image Translation (UNIT) framework presented in paper Unsupervised Image-to-Image Translation Networks which is based on generative adversarial networks (GANs) and variational auto-encoders (VAEs).

“We model each image domain using a VAE-GAN. The adversarial training objective interacts with a weight-sharing constraint, which enforces a shared- latent space, to generate corresponding images in two domains, while the variational autoencoders relate translated images with input images in the respective domains.” [arXiv:1703.00848]

The model consists of two domain image encoders E1 and E2 with shared layer, two domain image generators G1 and G2 with shared layer, and two domain adversarial discriminators D1 and D2.

The encoders are responsible for mapping an input image to a code in latent space which is later sent to a generator that reconstructs the image. The discriminators are trained to differentiate between real and fake images whereas the generators are trained to fool them.

The complete description of the model and exact objective functions can be found in the original paper mentioned above.

First results

The structure of our work is based on weekly meetings that define next steps needed for achieving the desired results in the long run.

Main task in the initial phases was to replicate the UNIT framework which was presented in the original paper Unsupervised Image-to-Image Translation Network in Tensorflow 2.0 and reproduce the described translations. We tested the newly created model on various datasets including orange2apple, summer2winter, and horse2zebra.

Figure 2: Transfer from an orange to an apple

Size of the datasets proved itself to be an issue since the model required a lot computing resources and that caused slow learning progress. In order to accelerate the process we used only parts of the datasets to train the model. Results (examples shown on following figures) ensured us that the model was working correctly and we were able to match the transfers proposed in the original work. All images presented in this article were produced by our networks.

Next steps followed shortly as we redesigned the model accordingly to the description in the original paper to create transfers between images of numbers from MNIST and USPS datasets. The performance of the model was measured by classifiers trained on the original images. Since the datasets are made of greyscale 28x28 pixel images we were able to use them in their entirety and augment the precision of the model. We achieved even better classification accuracies then those predicted by the original paper.

MNIST classificator accuracy: 0.9917
USPS classificator accuracy: 0.9882
USPS → MNIST accuracy: 0.9620 (predicted 0.9358)
MNIST → USPS accuracy: 0.9719 (predicted 0.9597)

Figure 5: Progress of MNIST → USPS transfer for image of number 7 (70 epochs, from original MNIST to USPS)

Figure 5: Progress of USPS → MNIST transfer for image of number 4 (70 epochs, from original USPS to MNIST)

As stated in the introduction of this article we wanted to try if latent space can be used for classifications. For this we created classifier which takes encoded images by trained image-to-image transfer model as input and tries to classify them based on their representation in the latent space. The classifier was trained on encoded images from both domains and later tested on images separately based on their domain. Unfortunately we were not able to use the whole datasets during training due to the latent space dimension and limited computing resources.

Latent space classifier accuracy: 0.9737
Original MNIST accuracy: 0.9714
Original USPS accuracy: 0.9100
Original MNIST and USPS combined accuracy: 0.9407

The results confirmed our hypothesis that the latent space can be used for classifying and produce accurate evaluations.

Figure 6: Redesigned model for MNIST ↔ USPS transfer

We tried to apply the same approach to create transfer from The Street View House Numbers (SVHN) dataset to MNIST. Problem was that the MNIST dataset is made out of images of black numbers on white background however SVHN covers images of variously coloured numbers on different backgrounds. This resulted in correct transfer with images from SVHN dataset where the numbers were dark on light background and wrong transfer with inversely coloured images. To help the model with training we created new dataset with inverted images of the original MNIST and merged them together. The transfers were not successful though as it appeared like the generator only inverted the colours of the original SVHN image.

Next, we redesigned the model again as we wanted to test its behaviour with three domains instead of just two. The training was based on MNIST, USPS and SVHN datasets with images resized to 32x32 pixels and transformed to RGB. Unfortunately this did not bring the desired results as the generated images could not get into the proper shape. To our surprise USPS → SVHN and MNIST → SVHN transfers were the most successful, but majority of the generated images were unreadable.

Following work

In the second half of the work we experimented with the output space dimensionality of the layers present in the model. Those test were performed at MNIST and USPS datasets as those are similar datasets where we could easily perform classifications and rank the performance of the model. The latent resolution of the original model for the 28x28 images is (7,7,1024).

Latent resolution: (7,7,16)
Lowering the latent resolution drastically improved the time necessary for learning the model. Even though the output space dimensionality of the shared encoder is 64-times smaller the results aren’t as worse as expected. For the USPS → MNIST and MNIST → USPS transfers, the classification accuracy was only 2% lower. For the latent space classifications the accuracy was 4% lower.
Latent resolution: (,784)
Furthermore, we replaced the final convolutional layer of the shared encoder with densely-connected layer. The classification results of the USPS → MNIST and MNIST → USPS transfers were a lot similar as with the original model, the accuracy differed only in ~0.3%. The latent space classifications accuracies were unfortunately much lower. The classifier achieved only accuracy of 0.8220 during the training. That is 15% lower than with the original model.
Other
We also tried lowering the output dimensionality of various layers in the model but the results were basically identical with the model with latent resolution of (7,7,16). Replacing more convolutional layers with densely-connected layers didn’t work properly as the arrangement of the image became fragmented.

Next, we altered the described model and created two new versions. Both ones worked without VAE which was essential part of the original model but the second one had confusing domains discriminator and it used gradient reversal layer to maximise the domain classification loss. We tried inputting the same MNIST and USPS datasets to compare the results.

Figure 7: Classification accuracies comparison of the generated images based on the learning epoch of the three models with latent resolution (7,7,1024)

We also tried to apply the same approach to the model with latent resolution of (7,7,16).

Figure 8: Classification accuracies comparison of the generated images based on the learning epoch of the three models with latent resolution (7,7,16)

This lead us to conclusion that with latent resolution of (7,7,1024) the model is highly dependent on VAE, but with the confusing domains discriminator it is possible to elevate the performance of the model. Although the original model still performs better and wasn’t surpassed by the two new versions of the model. Furthermore, for the latent resolution of (7,7,16) the performance of the two new models was nowhere close to the original model.

Layers transitions

In the original model most of its layers are shared. We wanted to test the model behaviour with lower number of shared layers. The first version follows the original structure of the model, but the output dimensionality of its layers is compressed. We named it Version 1 and it had the following structure.

Encoder
CONV-(N64,K5,S2), BatchNorm, LeakyReLU
________________________________________
Shared encoder
CONV-(N64,K5,S2), BatchNorm, LeakyReLU
CONV-(N128,K5,S2), BatchNorm, LeakyReLU
CONV-(N128,K5,S1), BatchNorm, LeakyReLU
CONV-(N256,K1,S1)
________________________________________
Shared generator
DCONV-(N256,K4,S1), BatchNorm, LeakyReLU
DCONV-(N128,K4,S2), BatchNorm, LeakyReLU
DCONV-(N64,K4,S2), BatchNorm, LeakyReLU
________________________________________
Generator
DCONV-(N64,K4,S2), BatchNorm, LeakyReLU
DCONV-(N1,K1,S1), TanH

To create the following models we transitioned the first layer of shared encoder and last layer of shared generator to the unshared blocks. The Version 2 looked like this.

Encoder
CONV-(N64,K5,S2), BatchNorm, LeakyReLU
CONV-(N64,K5,S2), BatchNorm, LeakyReLU
________________________________________
Shared encoder
CONV-(N128,K5,S2), BatchNorm, LeakyReLU
CONV-(N128,K5,S1), BatchNorm, LeakyReLU
CONV-(N256,K1,S1)
________________________________________
Shared generator
DCONV-(N256,K4,S1), BatchNorm, LeakyReLU
DCONV-(N128,K4,S2), BatchNorm, LeakyReLU
________________________________________
Generator
DCONV-(N64,K4,S2), BatchNorm, LeakyReLU
DCONV-(N64,K4,S2), BatchNorm, LeakyReLU
DCONV-(N1,K1,S1), TanH

The Version 3 followed the same transition approach.

Encoder
CONV-(N64,K5,S2), BatchNorm, LeakyReLU
CONV-(N64,K5,S2), BatchNorm, LeakyReLU
CONV-(N128,K5,S2), BatchNorm, LeakyReLU
________________________________________
Shared encoder
CONV-(N128,K5,S1), BatchNorm, LeakyReLU
CONV-(N256,K1,S1)
________________________________________
Shared generator
DCONV-(N256,K4,S1), BatchNorm, LeakyReLU
________________________________________
Generator
DCONV-(N128,K4,S2), BatchNorm, LeakyReLU
DCONV-(N64,K4,S2), BatchNorm, LeakyReLU
DCONV-(N64,K4,S2), BatchNorm, LeakyReLU
DCONV-(N1,K1,S1), TanH

Again, we tried to compare the quality of generated images in USPS → MNIST and MNIST → USPS transfers on trained classifiers. From the results it was clear that even though all the models are able to produce quality images the model with most shared layers had the best performance.

Figure 8: Classification accuracies comparison of the generated images based on the learning epoch of the three models with different number of shared layers

Adding third domain

Even though we explored the idea of adding third domain to the model in first stages of our work we wanted to pursue it further. Instead of training the model on three domains at once we trained the model on two separate domains and then ‘froze’ the unshared blocks of the model and added third encoder and generator which was learned from the ground up on the third domain. We started by learning the model on MNIST and SVHN (Street View House Numbers) and later added the USPS domain. To compare the accuracy we used the latent space classifier. We found out that the accuracy was only a few hundreds of percent better when the USPS dataset was fed through the newly trained encoder then when it was passed through the already trained MNIST encoder. This is caused due to the similarities of the two datasets. When we explored the same approach with training the model with MNIST and USPS datasets first and later adding the SVHN domain, the SVHN classification through the latent layer was much worse and didn’t even achieve 11% accuracy. This issue is needed to be analysed further and the model needs to be redesigned even further. For example by updating the shared layers during the training of the additional third unshared encoder. Also the SVHN dataset represents much more complex task for the model as often times the images show two and even three digit numbers and MNIST with USPS dataset show only one digit numbers. This leads to confusion inside the model and causes error that are propagated further during the training.

We tried to change the SVHN images from RGB to greyscale as well but the performance of the model didn’t show any improvements. The same approach was applied to transfer between photos and two different types of artistic styles. This task is much more time demanding as the resolution of the images is multiple times higher. The initial phases of the training shown some potentially acceptable results but the model would require to run a lot longer than with the MNIST/USPS/SVHN datasets.

Related work

Unsupervised Image-to-Image Translation Networks [arXiv:1703.00848]
This paper proposes unsupervised image-to-image translation framework based on Coupled GANs which is the one we based our work on. The authors describe the compositions of presented frameworks with exact objective functions.
Generative Adversarial Networks for Image-to-Image Translation on Multi-Contrast MR Images — A Comparison of CycleGAN and UNIT
[arXiv:1806.07777]
The authors evaluate two unsupervised GAN models (CycleGAN and UNIT) for image-to-image translation of MR images by comparing generated synthetic MR images to ground truth images. They were using the UNIT model proposed in the paper mentioned above.
Generative Adversarial Nets
[arXiv:1406.2661]
Framework for estimating generative models via an adversarial process is proposed in this paper. It describes the simultaneous training method of two models: a generative model that captures the data distribution, and a discriminative model that estimates the probability that a sample came from the training data rather than the generative model.
Image-to-Image Translation with Conditional Adversarial Networks
[arXiv:1611.07004]
This paper investigates conditional adversarial networks as a general purpose solution to image-to-image translation problems and demonstrates synthesising photos from label maps, reconstructing objects from edge maps, and colorizing images, among other tasks.
Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks
[arXiv:1703.10593]
The authors present an approach for learning to translate an image from a source domain to a target domain in the absence of paired examples using generative adversarial networks and demonstrate their results.
Transfer Learning — Machine Learning’s Next Frontier
[http://ruder.io/transfer-learning/]
The article describes the issue of transfer learning and the approach to model design featuring the confusing domains discriminator.

See more FIT CTU research blogposts.