# Review — MUNIT: Multimodal Unsupervised Image-to-Image Translation (GAN)

## Using MUNIT, Multi-Style Images Generated From Single Image

In this story, **Multimodal Unsupervised Image-to-Image Translation**, (MUNIT), by Cornell University, and NVIDIA, is reviewed. In this paper:

- In MUNIT, it is assumed that the
**image representation**can be decomposed into a**content code**that is domain-invariant, and a**style code**that captures domain-specific properties. - To translate an image to another domain, its
**content code is recombined with a random style code**sampled from the style space of the target domain. - Finally, MUNIT allows users to control the style of translation outputs by providing an example style image.

This is a paper in **2018 ECCV **with over **1100 citations**. (Sik-Ho Tsang @ Medium)

# Outline

**Introduction & Assumptions****Overview****Loss Function****Theoretical Analysis****MUNIT: Network Architecture****Experimental Results**

**1. Introduction & Assumptions**

- It is assumed that
**the latent space of images can be decomposed into a content space**, as shown in the above figure.*C*and a style space*S* - It is then further assumed that
**images in different domains share a common content space but not the style space.** **To translate**an image to the target domain,**its content code is recombined with a random style code in the target style space**.

The content code encodes the information that should be preserved during translation, while the style code represents remaining variations that are not contained in the input image.

**By sampling different style codes**, our model is able to produce**diverse and multimodal outputs.**

# 2. **Overview**

- The translation model consists of an
**encoder***Ei***decoder***Gi***domain**(*Xi**i*= 1, 2). **(a)**The latent code of each autoencoder is factorized into a**content code***ci***style code**, where:*si*

**(b) Image-to-image translation**is performed by swapping encoder-decoder pairs. For example, to**translate an image**, its*x*1 ∈*X*1 to*X*2**content latent code**and a*c*1 =*Ec*1(*x*1) is extracted**style latent code**from the prior distribution*s*2 is randomly drawn*q*(*s*2)~*N*(0,*I*).- Then
=*G*2 is used to produce the final output image*x*1→2*G*2(*c*1,*s*2). - It is noted that although the prior distribution is unimodal, the output image distribution can be multimodal thanks to the nonlinearity of the decoder.

# 3. **Loss Function**

## 3.1. Bidirectional Reconstruction Loss

- To learn pairs of encoder and decoder that are inverses of each other, we use objective functions that encourage reconstruction in both
**image → latent → image**and**latent → image → latent**directions:

## 3.1.1. Image Reconstruction

- Given
**an image sampled from the data distribution**, we should be able to**reconstruct it after encoding and decoding**:

## 3.1.2. Latent Reconstruction

- Given
**a latent code (style and content) sampled from the latent distribution**at translation time, we should be able to**reconstruct it after decoding and encoding.**

- where
*q*(*s*2) is the prior*N*(0,*I*),*p*(*c*1) is given by*c*1 =*Ec*1(*x*1) and*x*1 ~*p*(*x*1). - The other loss terms
,*Lx*2_*recon*, and*Lc*2_*recon*are defined in a similar manner.*Ls*1_*recon* **L1 reconstruction loss**is used as it**encourages sharp output images**.**The style reconstruction loss**has the effect on*Lsi_recon***encouraging diverse outputs**given different style codes.**The content reconstruction loss**encourages the translated image to*Lci_recon***preserve semantic content**of the input image.

## 3.2. Adversarial Loss

- GANs attempt to match the distribution of translated images to the target data distribution.
**Images generated by the model should be indistinguishable from real images in the target domain.**

- where
*D*2 is a discriminator that tries to distinguish between translated images and real images in*X*2. The discriminator*D*1 and loss*Lx*1_*GAN*are defined similarly. - This is quite a standard adversarial loss in GAN.

## 3.3. Total Loss

**The encoders, decoders, and discriminators are jointly trained**to optimize the final objective, which is a**weighted sum of the adversarial loss and the bidirectional reconstruction loss terms:**

- where
*λx*,*λc*,*λs*are weights that control the importance of reconstruction terms.

**4. Theoretical Analysis**

- Some propositions are established.

## 4.1. Proposition 1 (Optimized Encoders and Generators Found When Loss is Minimized)

- Minimizing the proposed loss function leads to 1) matching of latent distributions during encoding and generation, 2) matching of two joint image distributions induced by our framework, and 3) enforcing a weak form of cycle consistency constraint.
- Suppose there exists
*E*1*,*E*2*,*G*1*,*G*2* such that:

- Then:

## 4.2. Proposition 2 (Latent Distribution Matching)

**The autoencoder training would not help****GAN****training if the decoder received a very different latent distribution during generation.**Although the loss function does not contain terms that explicitly encourage the match of latent distributions, it has the effect of matching them implicitly.- When optimality is reached, we have:

- The above proposition shows that
**at optimality, the encoded style distributions match their Gaussian priors.** - This suggests that the content space becomes domain-invariant.

## 4.3. Proposition 3 (Joint Distribution Matching)

- The model learns two conditional distributions

- which, together with the data distributions, define two joint distributions:

- Since both of them are designed to
**approximate the same underlying joint distribution**, it is desirable that*p*(*x*1,*x*2)**they are consistent with each other**, i.e.:

- Joint distribution matching provides an
**important constraint**for unsupervised image-to-image translation and is behind the success of many recent methods. The proposed model matches the joint distributions at optimality. When optimality is reached, we have:

## 4.4. Proposition 4 (Style-augmented Cycle Consistency)

- Joint distribution matching can be realized via
**cycle consistency constraint**, as in CycleGAN. However, this constraint is**too strong for multimodal image translation**. The translation model will**degenerate to a deterministic function**if cycle consistency is enforced. **MUNIT framework admits a weaker form of cycle consistency**, termed as**style-augmented cycle consistency**, between the image-style joint spaces, which is**more suited for multimodal image translation**.- When optimality is achieved, we have:

- Intuitively, style-augmented cycle consistency implies that if we translate an image to the target domain and translate it back using the original style, we should obtain the original image.
**Style-augmented cycle consistency is implied by the proposed bidirectional reconstruction loss**, but explicitly enforcing it could be useful for some datasets:

- (If interested, there are more details about these propositions in the paper.)

# 5. **MUNIT: Network Architecture**

## 5.1. Style encoder

- The style encoder includes several
**strided conv**olutional layers, followed by a**global average pooling**(GAP) layer and a**fully connected**(FC) layer. **Instance Normalization (IN) layers are NOT used**in the style encoder, since IN removes the original feature mean and variance that represent important style information.

## 5.2. Decoder

- The decoder
**reconstructs the input image from its content and style code.** - It processes the content code by
**a set of residual blocks**and finally produces the reconstructed image by**several upsampling and convolutional layers.** - Inspired by recent works that use affine transformation parameters in normalization layers to represent styles [54, 72-74], we equip the residual blocks with
**Adaptive Instance Normalization (AdaIN)**[54] layers whose**parameters are dynamically generated by a multilayer perceptron (MLP) from the style code.**

- where
*z*is the activation of the previous convolutional layer,*μ*and*σ*are channel-wise mean and standard deviation,*γ*and*β*are parameters generated by the MLP.

## 5.3. Discriminator

- The
**LSGAN objective**and**multi-scale discriminators**are used, to guide the generators to produce both realistic details and correct global structure. - (LSGAN uses least square for loss function, hope I can write a story about it in the future.)

## 5.4. Domain-Invariant Perceptual Loss

- The perceptual loss, often computed as a distance in the VGG feature space between the output and the reference image.
- A modified version of this loss is used, which is more domain-invariant.
- Specifically,
**before computing the distance, Instance Normalization is performed on the VGG features**in order to remove the original feature mean and variance, which contains much domain-specific information.

# 6. Experimental Results

## 6.1. **Qualitative Comparison**

- Each following column shows 3 random outputs from a method.
- Both UNIT and CycleGAN (with or without noise) fail to generate diverse outputs, despite the injected randomness.
- Without
*Lx*_*recon*or*Lc*_*recon*, the image quality of MUNIT is unsatisfactory. Without*Ls_recon*, the model suffers from partial mode collapse, with many outputs being almost identical (e.g., the first two rows).

The full model produces images that are

both diverse and realistic, similar to BicycleGAN but does not need supervision.

## 6.2. **Quantitative** Comparison

- Human preference is used to measure quality and LPIPS distance is used to evaluate the diversity.
- UNIT and CycleGAN produce very little diversity according to LPIPS distance.
- Removing
*Lx_recon*or*Lc_recon*from MUNIT leads to significantly worse quality. Without*Ls_recon*, both quality and diversity deteriorate.

The full model obtains

quality and diversity comparable to the fully supervised BicycleGAN, and significantlybetter than all unsupervised baselines.

**Inception Score (IS)**measures**diversity of all output images,**while**Conditional IS (CIS)**measures**diversity of outputs conditioned on a single input image.**- MUNIT model obtains the highest scores according to both CIS and IS.
- In particular, the baselines all obtain a very low CIS, indicating their failure to generate multimodal outputs from a given input.

## 6.3. Other Datasets

- The model is able to generate SYNTHIA images with
**diverse renderings (e.g., rainy, snowy, sunset)**from a given Cityscape image, and generate Cityscape images with**different lighting, shadow, and road textures**from a given SYNTHIA image.

- Similarly, it generates winter images with different amount of snow from a given summer image, and summer images with different amount of leaves from a given winter image.

## 6.4. Example-Guided Image Translation

- Each row has the same content while each column has the same style. The color of the generated shoes and the appearance of the generated cats can be specified by providing example style images.
- Instead of sampling the style code from the prior, it is also possible to
**extract the style code from a reference image.**

- Classical style transfer algorithms are also compared.
- MUNIT produces results that are
**significantly more faithful and realistic,**since it learns the distribution of target domain images using GANs.

## Reference

[2018 ECCV] [MUNIT]

Multimodal Unsupervised Image-to-Image Translation

## Generative Adversarial Network (GAN)

**Image Synthesis** [GAN] [CGAN] [LAPGAN] [AAE] [DCGAN] [CoGAN] [SimGAN] [BiGAN] [ALI]**Image-to-image Translation **[Pix2Pix] [UNIT] [CycleGAN] [MUNIT]**Super Resolution** [SRGAN & SRResNet] [EnhanceNet] [ESRGAN]**Blur Detection** [DMENet]**Camera Tampering Detection **[Mantini’s VISAPP’19]**Video Coding** [VC-LAPGAN] [Zhu TMM’20] [Zhong ELECGJ’21]