# Review — UNIT: Unsupervised Image-to-Image Translation Networks (GAN)

In this story, **Unsupervised Image-to-Image Translation Networks**, (UNIT), by NVIDIA, is reviewed. In this paper:

**A shared-latent space assumption**is made, which assumes a pair of corresponding images in different domains can be mapped to a same latent representation in a shared-latent space.**Image-to-Image Translation is performed through that shared-latent space.**(At that moment, there exist no paired examples showing how an image could be translated to a corresponding image in another domain.)

This is a paper in **2017 NIPS **with over **1500 citations**. (Sik-Ho Tsang @ Medium)

# Outline

**Shared-Latent Space Assumption****UNIT: Framework****Training Loss****Image-to-image Translation Results****Domain Adaptation Results**

**1. Shared-Latent Space Assumption**

It is assumed that,

for any given pair of imagesx1 andx2 from two different domainsX1 andX2 respectively, there exists a shared latent codezin a shared-latent spaceZ.

and*E*1are 2*E*2**encoding functions, mapping images to latent codes.**and*G*1are 2*G*2**generation functions, mapping latent codes to images.**

**2. UNIT: Framework**

## 2.1. Variational Autoencoder (VAE)

- The encoder–generator pair
**{**constitutes a VAE for the*E*1,*G*1}*X*1 domain, termed**VAE1**.

VAE1 first maps input imageand thenx1 to a code in a latent spaceZvia the encoderE1decodes a random-perturbed version of the code to reconstruct the input image via the generatorG1.

- It is assumed the components in the
**latent space***Z***conditionally independent and Gaussian with unit variance.** - Similarly,
**{**constitutes a VAE for*E2*,*G2*}*X*2:**VAE2**. - For self reconstruction, i.e.
*X*1 →*X*1 and*X*2 →*X*2 through*Z*, it is called the image reconstruction stream. **This image reconstruction stream can be supervisely trained**, since there is ground-truth.

## 2.2. Weight Sharing Constraint

E1,E2,G1 andG2 use CNNs and implement the shared-latent space assumption using aweight sharing constraint (dashed lines)where the connection weights ofthe last few layers (high-level layers) inare tied (illustrated using dashed lines) and the connection weights ofE1 andE2the first few layers (high-level layers) inare tied.G1 andG2

- The weight sharing constraint is originally from CoGAN. Thus, authors are actually extending CoGAN for image-to-image translation.
- And the shared-latent space constraint implies the cycle-consistency constraint from CycleGAN.
- Through adversarial training, image-to-image translation can be performed using the above VAEs.

## 2.3. GAN

- There are two generative adversarial networks:
**GAN1 = {**and*D*1,*G*1}**GAN2 = {**.*D*2,*G*2} - for real images sampled from the first domain,
*D*1 should output true, while for images generated by*G*1, it should output false. - The images generated by
*G*1 can come from 2 sources, one is the reconstruction (VAE), one is the translation (GAN). - For image-to-image translation, i.e.
*X*1 →*X*2 and*X*2 →*X*1 through*Z*, it is called the image translation stream. - In contrast to the self reconstruction,
**this image translation stream is adversary trained**.

## 2.4. Cycle-Consistency (CC)

**The cycle-consistency constraint can be enforced**in the proposed framework to**further regularize**the ill-posed unsupervised image-to-image translation problem.- The resulting information processing stream is called the
**cycle-reconstruction stream.**

(Since there are VAE and GAN, more mathematical expressions are used in the paper, e.g. how VAE draws the latent vector based on the input image. If interested, please read the paper.)

# 3. **Training Loss**

- The image reconstruction streams, the image translation streams, and the cycle-reconstruction streams by VAE1, VAE2, GAN1 and GAN2 are
**jointly trained**:

## 3.1. Image Reconstruction Streams

- The loss function of the
**VAE**is the**negative log-likelihood with a regularizer**:

**The KL divergence terms penalize deviation of the distribution of the latent code from the prior distribution.**- It is quite a stand VAE loss. (Please feel free to read Tutorial — What is a variational autoencoder?)

## 3.2. Image Translation Streams

- They are used to
**ensure the translated images resembling images in the target domains**, respectively.

**3.3. Cycle-Reconstruction Stream**

- A
**VAE-like objective function**is used to**model the cycle-consistency constraint**:

- where the
**negative log-likelihood objective term ensures a twice translated image resembles the input one.** - The
**KL terms penalize the latent codes deviating from the prior distribution in the cycle-reconstruction stream**(Therefore, there are two KL terms). *λ*0 = 10,*λ*3 =*λ*1 = 0.1 and*λ*4 =*λ*2 = 100.

## 3.4. Alternating Gradient Update Scheme

- The first player is a team consisting of the encoders and generators. The second player is a team consisting of the adversarial discriminators.
- In addition to defeating the second player, the first player has to minimize the VAE losses and the cycle-consistency losses.
- An
**alternating gradient update scheme**is used.

Specifically,

a gradient ascent stepis first applied toupdatewithD1 andD2E1,E2,G1, andG2 fixed. thena gradient descent stepis applied toupdatewithE1,E2,G1, andG2D1 andD2 fixed.

# 4. Image-to-image Translation Results

- Each mini-batch consisted of one image from the first domain and one image from the second domain.
- For the network architecture, the
**encoders**consisted of**3 convolutional layers**as the front-end and**4 basic residual blocks**as the back-end. - The
**generators**consisted of**4 basic residual blocks**as the front-end and**3 transposed convolutional layers**as the back-end. - The
**discriminators**consisted of**stacks of convolutional layers**. **LeakyReLU**is used for nonlinearity.

## 4.1. Map Dataset for Ablation Study

**(a)**: Illustration of the**Map dataset**. Left: satellite image. Right: map.- A pixel translation was counted correct if the color difference was within 16 of the ground truth color value.

The

average pixel accuracyis measured.

**(b)**: The number of weight-sharing layers is changed from 1 to 4. Different number of layers is used for discriminator.**The shallowest discriminator**architecture led to the**worst performance**.- It is found that
**the number of weight-sharing layer had little impact.**This was due to the use of the residual blocks.

Based on this result, in the rest of the experiments,

VAEs with 1 sharing layeranddiscriminators of 5 layersare used.

**(c)**: Translation accuracy versus different hyper-parameter values.- In general,
**a larger weight value on the negative log likelihood terms**yielded a**better translation accuracy**. - It is also found setting
**the weights of the KL terms to 0.1 resulted in consistently good performance**.

It is thus set

λ1 =λ3 = 0.1 andλ2 =λ4 = 100.

**(d)**: Impact of weight-sharing and cycle-consistency constraints on translation accuracy.**When removing the weight-sharing constraint**(as a consequence, the reconstruction streams are removed in the framework),**the framework was reduced to the CycleGAN architecture**. The model achieved an**average pixel accuracy of 0.569.**- (I hope I can review CycleGAN in the coming future.)
- When
**removing the cycle-consistency constraint**and only with the weight-sharing constraint used, it achieved**0.568 average pixel accuracy**. - But when using the
**full model**, UNIT achieved the best performance of**0.600 average pixel accuracy**.

For the ill-posed joint distribution recovery problem,

more constraints are beneficial.

## 4.2. Qualitative Results

- UNIT is applied to several unsupervised street scene image translation tasks including
**sunny to rainy, day to night, summery to snowy**, and vice versa. - For the
**real to synthetic**translation, UNIT made the cityscape images cartoon like. For the**synthetic to real**translation, UNIT achieved better results in the building, sky, road, and car regions than in the human regions.

- Dog images in ImageNet are used to learn to
**translate dog images between different breeds**.

- Similarly, cat images in ImageNet dataset are used to learn to
**translate cat images between different species**.

- The CelebA dataset is used for
**attribute-based face images translation**. - The attributes includes blond hair, smiling, goatee, and eyeglasses.
- The translated face images were realistic.

# 5. Domain Adaptation Results

UNIT is applied to the problem for

adapting a classifier trained using labeled samples in one domain (source domain) to classify samples in a new domain (target domain) where labeled samples in the new domain are unavailable during training.

- The framework is trained to translate images between the source and target domains, and classify samples in the source domain using the features extracted by the discriminator in the source domain.

The weights of the high-level layers ofThis allows the classifier trained in the source domain adapted to the target domain.D1 andD2 are tied.Also, for a pair of generated images in different domains,

the L1 distance between the features extracted by the highest layer of the discriminators, is minimized, which furtherencouragedD1 andD2 to interpret a pair of corresponding images in the same way.

- UNIT is applied to several tasks including
**adapting from the Street View House Number (SVHN) dataset to the MNIST dataset**and**adapting between the MNIST and USPS datasets**. **UNIT achieved a 0.9053 accuracy for the SVHN→MNIST task**, which was much better than 0.8488 achieved by the previous state-of-the-art method [26].- UNIT also
**achieved better performance for the MNIST→SVHN task than the****CoGAN**, which was the state-of-the-art. - The digit images had a small resolution. Hence,
**a small network**was used. - It is also found that the
**cycle-consistency constraint was not necessary**for this task.

## Reference

[2017 NIPS] [UNIT]

Unsupervised Image-to-Image Translation Networks

## Generative Adversarial Network (GAN)

**Image Synthesis** [GAN] [CGAN] [LAPGAN] [DCGAN] [CoGAN]**Image-to-image Translation **[Pix2Pix] [UNIT]**Super Resolution** [SRGAN & SRResNet] [EnhanceNet] [ESRGAN]**Blur Detection** [DMENet]**Camera Tampering Detection **[Mantini’s VISAPP’19]**Video Coding** [VC-LAPGAN] [Zhu TMM’20] [Zhong ELECGJ’21]