Review — UNIT: Unsupervised Image-to-Image Translation Networks (GAN)

With Shared-Latent Space Assumption, Extending CoGAN, Outperforms CoGAN on Domain Adaptation

Sik-Ho Tsang

Published in

Nerd For Tech

8 min readApr 10, 2021

In this story, Unsupervised Image-to-Image Translation Networks, (UNIT), by NVIDIA, is reviewed. In this paper:

A shared-latent space assumption is made, which assumes a pair of corresponding images in different domains can be mapped to a same latent representation in a shared-latent space.
Image-to-Image Translation is performed through that shared-latent space. (At that moment, there exist no paired examples showing how an image could be translated to a corresponding image in another domain.)

This is a paper in 2017 NIPS with over 1500 citations. (Sik-Ho Tsang @ Medium)

Outline

Shared-Latent Space Assumption
UNIT: Framework
Training Loss
Image-to-image Translation Results
Domain Adaptation Results

1. Shared-Latent Space Assumption

It is assumed that, for any given pair of images x1 and x2 from two different domains X1 and X2 respectively, there exists a shared latent code z in a shared-latent space Z.

E1 and E2 are 2 encoding functions, mapping images to latent codes.
G1 and G2 are 2 generation functions, mapping latent codes to images.

2. UNIT: Framework

2.1. Variational Autoencoder (VAE)

The encoder–generator pair {E1, G1} constitutes a VAE for the X1 domain, termed VAE1.

VAE1 first maps input image x1 to a code in a latent space Z via the encoder E1 and then decodes a random-perturbed version of the code to reconstruct the input image via the generator G1.

It is assumed the components in the latent space Z are conditionally independent and Gaussian with unit variance.
Similarly, {E2, G2} constitutes a VAE for X2: VAE2.
For self reconstruction, i.e. X1 → X1 and X2 → X2 through Z, it is called the image reconstruction stream.
This image reconstruction stream can be supervisely trained, since there is ground-truth.

2.2. Weight Sharing Constraint

E1, E2, G1 and G2 use CNNs and implement the shared-latent space assumption using a weight sharing constraint (dashed lines) where the connection weights of the last few layers (high-level layers) in E1 and E2 are tied (illustrated using dashed lines) and the connection weights of the first few layers (high-level layers) in G1 and G2 are tied.

The weight sharing constraint is originally from CoGAN. Thus, authors are actually extending CoGAN for image-to-image translation.
And the shared-latent space constraint implies the cycle-consistency constraint from CycleGAN.
Through adversarial training, image-to-image translation can be performed using the above VAEs.

2.3. GAN

There are two generative adversarial networks: GAN1 = {D1, G1} and GAN2 = {D2, G2}.
for real images sampled from the first domain, D1 should output true, while for images generated by G1, it should output false.
The images generated by G1 can come from 2 sources, one is the reconstruction (VAE), one is the translation (GAN).
For image-to-image translation, i.e. X1 → X2 and X2 → X1 through Z, it is called the image translation stream.
In contrast to the self reconstruction, this image translation stream is adversary trained.

2.4. Cycle-Consistency (CC)

The cycle-consistency constraint can be enforced in the proposed framework to further regularize the ill-posed unsupervised image-to-image translation problem.
The resulting information processing stream is called the cycle-reconstruction stream.

(Since there are VAE and GAN, more mathematical expressions are used in the paper, e.g. how VAE draws the latent vector based on the input image. If interested, please read the paper.)

3. Training Loss

The image reconstruction streams, the image translation streams, and the cycle-reconstruction streams by VAE1, VAE2, GAN1 and GAN2 are jointly trained:

3.1. Image Reconstruction Streams

The loss function of the VAE is the negative log-likelihood with a regularizer:

The KL divergence terms penalize deviation of the distribution of the latent code from the prior distribution.
It is quite a stand VAE loss. (Please feel free to read Tutorial — What is a variational autoencoder?)

3.2. Image Translation Streams

The GAN loss is conditional GAN (CGAN) loss since z is based on input image x:

They are used to ensure the translated images resembling images in the target domains, respectively.

3.3. Cycle-Reconstruction Stream

**The figure is from CycleGAN (We can treat X as X1, Y as X2, or treat X as X2, Y as X1 here.)**

A VAE-like objective function is used to model the cycle-consistency constraint:

where the negative log-likelihood objective term ensures a twice translated image resembles the input one.
The KL terms penalize the latent codes deviating from the prior distribution in the cycle-reconstruction stream (Therefore, there are two KL terms).
λ0 = 10, λ3 = λ1 = 0.1 and λ4 = λ2 = 100.

3.4. Alternating Gradient Update Scheme

The first player is a team consisting of the encoders and generators. The second player is a team consisting of the adversarial discriminators.
In addition to defeating the second player, the first player has to minimize the VAE losses and the cycle-consistency losses.
An alternating gradient update scheme is used.

Specifically, a gradient ascent step is first applied to update D1 and D2 with E1, E2, G1, and G2 fixed. then a gradient descent step is applied to update E1, E2, G1, and G2 with D1 and D2 fixed.

4. Image-to-image Translation Results

**Network architecture for image-to-image translation**

Each mini-batch consisted of one image from the first domain and one image from the second domain.
For the network architecture, the encoders consisted of 3 convolutional layers as the front-end and 4 basic residual blocks as the back-end.
The generators consisted of 4 basic residual blocks as the front-end and 3 transposed convolutional layers as the back-end.
The discriminators consisted of stacks of convolutional layers.
LeakyReLU is used for nonlinearity.

4.1. Map Dataset for Ablation Study

(a): Illustration of the Map dataset. Left: satellite image. Right: map.
A pixel translation was counted correct if the color difference was within 16 of the ground truth color value.

The average pixel accuracy is measured.

(b): The number of weight-sharing layers is changed from 1 to 4. Different number of layers is used for discriminator.
The shallowest discriminator architecture led to the worst performance.
It is found that the number of weight-sharing layer had little impact. This was due to the use of the residual blocks.

Based on this result, in the rest of the experiments, VAEs with 1 sharing layer and discriminators of 5 layers are used.

(c): Translation accuracy versus different hyper-parameter values.
In general, a larger weight value on the negative log likelihood terms yielded a better translation accuracy.
It is also found setting the weights of the KL terms to 0.1 resulted in consistently good performance.

It is thus set λ1 = λ3 = 0.1 and λ2 = λ4 = 100.

(d): Impact of weight-sharing and cycle-consistency constraints on translation accuracy.
When removing the weight-sharing constraint (as a consequence, the reconstruction streams are removed in the framework), the framework was reduced to the CycleGAN architecture. The model achieved an average pixel accuracy of 0.569.
(I hope I can review CycleGAN in the coming future.)
When removing the cycle-consistency constraint and only with the weight-sharing constraint used, it achieved 0.568 average pixel accuracy.
But when using the full model, UNIT achieved the best performance of 0.600 average pixel accuracy.

For the ill-posed joint distribution recovery problem, more constraints are beneficial.

4.2. Qualitative Results

**Street scene image translation results. For each pair, left is input and right is the translated image.**

UNIT is applied to several unsupervised street scene image translation tasks including sunny to rainy, day to night, summery to snowy, and vice versa.
For the real to synthetic translation, UNIT made the cityscape images cartoon like. For the synthetic to real translation, UNIT achieved better results in the building, sky, road, and car regions than in the human regions.

Dog images in ImageNet are used to learn to translate dog images between different breeds.

Similarly, cat images in ImageNet dataset are used to learn to translate cat images between different species.

**Attribute-based face translation results**

The CelebA dataset is used for attribute-based face images translation.
The attributes includes blond hair, smiling, goatee, and eyeglasses.
The translated face images were realistic.

5. Domain Adaptation Results

UNIT is applied to the problem for adapting a classifier trained using labeled samples in one domain (source domain) to classify samples in a new domain (target domain) where labeled samples in the new domain are unavailable during training.

The framework is trained to translate images between the source and target domains, and classify samples in the source domain using the features extracted by the discriminator in the source domain.

The weights of the high-level layers of D1 and D2 are tied. This allows the classifier trained in the source domain adapted to the target domain.
Also, for a pair of generated images in different domains, the L1 distance between the features extracted by the highest layer of the discriminators, is minimized, which further encouraged D1 and D2 to interpret a pair of corresponding images in the same way.

**Unsupervised domain adaptation performance**

UNIT is applied to several tasks including adapting from the Street View House Number (SVHN) dataset to the MNIST dataset and adapting between the MNIST and USPS datasets.
UNIT achieved a 0.9053 accuracy for the SVHN→MNIST task, which was much better than 0.8488 achieved by the previous state-of-the-art method [26].
UNIT also achieved better performance for the MNIST→SVHN task than the CoGAN, which was the state-of-the-art.
The digit images had a small resolution. Hence, a small network was used.
It is also found that the cycle-consistency constraint was not necessary for this task.

Reference

[2017 NIPS] [UNIT]
Unsupervised Image-to-Image Translation Networks

Generative Adversarial Network (GAN)

Image Synthesis [GAN] [CGAN] [LAPGAN] [DCGAN] [CoGAN]
Image-to-image Translation [Pix2Pix] [UNIT]
Super Resolution [SRGAN & SRResNet] [EnhanceNet] [ESRGAN]
Blur Detection [DMENet]
Camera Tampering Detection [Mantini’s VISAPP’19]
Video Coding [VC-LAPGAN] [Zhu TMM’20] [Zhong ELECGJ’21]