Review — MUNIT: Multimodal Unsupervised Image-to-Image Translation (GAN)

Using MUNIT, Multi-Style Images Generated From Single Image

Published in

CodeX

9 min readJul 22, 2021

**Animal image translation using MUNIT**

In this story, Multimodal Unsupervised Image-to-Image Translation, (MUNIT), by Cornell University, and NVIDIA, is reviewed. In this paper:

In MUNIT, it is assumed that the image representation can be decomposed into a content code that is domain-invariant, and a style code that captures domain-specific properties.
To translate an image to another domain, its content code is recombined with a random style code sampled from the style space of the target domain.
Finally, MUNIT allows users to control the style of translation outputs by providing an example style image.

This is a paper in 2018 ECCV with over 1100 citations. (Sik-Ho Tsang @ Medium)

Outline

Introduction & Assumptions
Overview
Loss Function
Theoretical Analysis
MUNIT: Network Architecture
Experimental Results

1. Introduction & Assumptions

**MUNIT: Multimodal Unsupervised Image-to-Image Translation**

It is assumed that the latent space of images can be decomposed into a content space C and a style space S, as shown in the above figure.
It is then further assumed that images in different domains share a common content space but not the style space.
To translate an image to the target domain, its content code is recombined with a random style code in the target style space.

The content code encodes the information that should be preserved during translation, while the style code represents remaining variations that are not contained in the input image.

By sampling different style codes, our model is able to produce diverse and multimodal outputs.

2. Overview

The translation model consists of an encoder Ei and a decoder Gi for each domain Xi (i = 1, 2).
(a) The latent code of each autoencoder is factorized into a content code ci and a style code si, where:

(b) Image-to-image translation is performed by swapping encoder-decoder pairs. For example, to translate an image x1 ∈ X1 to X2, its content latent code c1 = Ec1(x1) is extracted and a style latent code s2 is randomly drawn from the prior distribution q(s2)~N(0, I).
Then G2 is used to produce the final output image x1→2 = G2(c1, s2).
It is noted that although the prior distribution is unimodal, the output image distribution can be multimodal thanks to the nonlinearity of the decoder.

3. Loss Function

3.1. Bidirectional Reconstruction Loss

To learn pairs of encoder and decoder that are inverses of each other, we use objective functions that encourage reconstruction in both image → latent → image and latent → image → latent directions:

3.1.1. Image Reconstruction

Given an image sampled from the data distribution, we should be able to reconstruct it after encoding and decoding:

3.1.2. Latent Reconstruction

Given a latent code (style and content) sampled from the latent distribution at translation time, we should be able to reconstruct it after decoding and encoding.

where q(s2) is the prior N(0, I), p(c1) is given by c1 = Ec1(x1) and x1 ~ p(x1).
The other loss terms Lx2_recon, Lc2_recon, and Ls1_recon are defined in a similar manner.
L1 reconstruction loss is used as it encourages sharp output images.
The style reconstruction loss Lsi_recon has the effect on encouraging diverse outputs given different style codes.
The content reconstruction loss Lci_recon encourages the translated image to preserve semantic content of the input image.

3.2. Adversarial Loss

GANs attempt to match the distribution of translated images to the target data distribution. Images generated by the model should be indistinguishable from real images in the target domain.

where D2 is a discriminator that tries to distinguish between translated images and real images in X2. The discriminator D1 and loss Lx1_GAN are defined similarly.
This is quite a standard adversarial loss in GAN.

3.3. Total Loss

The encoders, decoders, and discriminators are jointly trained to optimize the final objective, which is a weighted sum of the adversarial loss and the bidirectional reconstruction loss terms:

where λx, λc, λs are weights that control the importance of reconstruction terms.

4. Theoretical Analysis

Some propositions are established.

4.1. Proposition 1 (Optimized Encoders and Generators Found When Loss is Minimized)

Minimizing the proposed loss function leads to 1) matching of latent distributions during encoding and generation, 2) matching of two joint image distributions induced by our framework, and 3) enforcing a weak form of cycle consistency constraint.
Suppose there exists E1*, E2*, G1*, G2* such that:

Then:

4.2. Proposition 2 (Latent Distribution Matching)

The autoencoder training would not help GAN training if the decoder received a very different latent distribution during generation. Although the loss function does not contain terms that explicitly encourage the match of latent distributions, it has the effect of matching them implicitly.
When optimality is reached, we have:

The above proposition shows that at optimality, the encoded style distributions match their Gaussian priors.
This suggests that the content space becomes domain-invariant.

4.3. Proposition 3 (Joint Distribution Matching)

The model learns two conditional distributions

which, together with the data distributions, define two joint distributions:

Since both of them are designed to approximate the same underlying joint distribution p(x1, x2), it is desirable that they are consistent with each other, i.e.:

Joint distribution matching provides an important constraint for unsupervised image-to-image translation and is behind the success of many recent methods. The proposed model matches the joint distributions at optimality. When optimality is reached, we have:

4.4. Proposition 4 (Style-augmented Cycle Consistency)

Joint distribution matching can be realized via cycle consistency constraint, as in CycleGAN. However, this constraint is too strong for multimodal image translation. The translation model will degenerate to a deterministic function if cycle consistency is enforced.
MUNIT framework admits a weaker form of cycle consistency, termed as style-augmented cycle consistency, between the image-style joint spaces, which is more suited for multimodal image translation.
When optimality is achieved, we have:

Intuitively, style-augmented cycle consistency implies that if we translate an image to the target domain and translate it back using the original style, we should obtain the original image.
Style-augmented cycle consistency is implied by the proposed bidirectional reconstruction loss, but explicitly enforcing it could be useful for some datasets:

(If interested, there are more details about these propositions in the paper.)

5. MUNIT: Network Architecture

5.1. Style encoder

The style encoder includes several strided convolutional layers, followed by a global average pooling (GAP) layer and a fully connected (FC) layer.
Instance Normalization (IN) layers are NOT used in the style encoder, since IN removes the original feature mean and variance that represent important style information.

5.2. Decoder

The decoder reconstructs the input image from its content and style code.
It processes the content code by a set of residual blocks and finally produces the reconstructed image by several upsampling and convolutional layers.
Inspired by recent works that use affine transformation parameters in normalization layers to represent styles [54, 72-74], we equip the residual blocks with Adaptive Instance Normalization (AdaIN) [54] layers whose parameters are dynamically generated by a multilayer perceptron (MLP) from the style code.

where z is the activation of the previous convolutional layer, μ and σ are channel-wise mean and standard deviation, γ and β are parameters generated by the MLP.

5.3. Discriminator

The LSGAN objective and multi-scale discriminators are used, to guide the generators to produce both realistic details and correct global structure.
(LSGAN uses least square for loss function, hope I can write a story about it in the future.)

5.4. Domain-Invariant Perceptual Loss

The perceptual loss, often computed as a distance in the VGG feature space between the output and the reference image.
A modified version of this loss is used, which is more domain-invariant.
Specifically, before computing the distance, Instance Normalization is performed on the VGG features in order to remove the original feature mean and variance, which contains much domain-specific information.

6. Experimental Results

6.1. Qualitative Comparison

**Qualitative comparison on edges → shoes**

Each following column shows 3 random outputs from a method.
Both UNIT and CycleGAN (with or without noise) fail to generate diverse outputs, despite the injected randomness.
Without Lx_recon or Lc_recon, the image quality of MUNIT is unsatisfactory. Without Ls_recon, the model suffers from partial mode collapse, with many outputs being almost identical (e.g., the first two rows).

The full model produces images that are both diverse and realistic, similar to BicycleGAN but does not need supervision.

6.2. Quantitative Comparison

**Quantitative evaluation on edges → shoes/handbags**

Human preference is used to measure quality and LPIPS distance is used to evaluate the diversity.
UNIT and CycleGAN produce very little diversity according to LPIPS distance.
Removing Lx_recon or Lc_recon from MUNIT leads to significantly worse quality. Without Ls_recon, both quality and diversity deteriorate.

The full model obtains quality and diversity comparable to the fully supervised BicycleGAN, and significantly better than all unsupervised baselines.

**Quantitative evaluation on animal image translation**

Inception Score (IS) measures diversity of all output images, while Conditional IS (CIS) measures diversity of outputs conditioned on a single input image.
MUNIT model obtains the highest scores according to both CIS and IS.
In particular, the baselines all obtain a very low CIS, indicating their failure to generate multimodal outputs from a given input.

6.3. Other Datasets

**Example results on street scene translations**

The model is able to generate SYNTHIA images with diverse renderings (e.g., rainy, snowy, sunset) from a given Cityscape image, and generate Cityscape images with different lighting, shadow, and road textures from a given SYNTHIA image.

**Example results on Yosemite summer ↔ winter (HD resolution).**

Similarly, it generates winter images with different amount of snow from a given summer image, and summer images with different amount of leaves from a given winter image.

6.4. Example-Guided Image Translation

Each row has the same content while each column has the same style. The color of the generated shoes and the appearance of the generated cats can be specified by providing example style images.
Instead of sampling the style code from the prior, it is also possible to extract the style code from a reference image.

**Comparison with existing style transfer methods**

Classical style transfer algorithms are also compared.
MUNIT produces results that are significantly more faithful and realistic, since it learns the distribution of target domain images using GANs.

Reference

[2018 ECCV] [MUNIT]
Multimodal Unsupervised Image-to-Image Translation

Generative Adversarial Network (GAN)

Image Synthesis [GAN] [CGAN] [LAPGAN] [AAE] [DCGAN] [CoGAN] [SimGAN] [BiGAN] [ALI]
Image-to-image Translation [Pix2Pix] [UNIT] [CycleGAN] [MUNIT]
Super Resolution [SRGAN & SRResNet] [EnhanceNet] [ESRGAN]
Blur Detection [DMENet]
Camera Tampering Detection [Mantini’s VISAPP’19]
Video Coding [VC-LAPGAN] [Zhu TMM’20] [Zhong ELECGJ’21]