# How to train Multimodal Image-to-Image Translation using MISO

Image-to-image translation is the task of translating an image of a certain domain to another domain while preserving certain semantic content. There are largely two problems in image-to-image translation, which are learning from paired data and unpaired data. Collecting paired datasets can be problematic in real-world problems, and learning algorithms based on unpaired data such as the popular CycleGAN has gained more interest.

Another property of image-to-image translation is the inherent multimodal nature of the problem. A single image can be translated into multiple, equally realistic images as illustrated in the figure above. Research on *unpaired multimodal image-to-image translation* such as this paper aims to generate diverse images from a single image in an unpaired dataset.

This paper …

- Uses
*content representation*from the source domain conditioned on a style representation from the target domain(MISO pipeline). - Proposes Mutual Information LOss(MILO) as the loss function.
- Improves the performance of unpaired multimodal image-to-image translation to surprising levels
*.*

## Previous Approaches

We will shortly discuss the ideas of previous work on unpaired multimodal image-to-image translation. This section is written based on the MISO paper. For more detail, refer to each paper.

Multimodal mappings can be learned by mapping a pair of (noise, source image) to the target image. BicycleGAN proposes a two-phase training that translates between image and *features *for multimodal paired translation. Precisely, the training consists of translating X → Z → X (Image-Feature-Image, IFI) and Z → X → Z (Feature-Image-Feature, FIF), each phase trained with different loss functions.

Work on unpaired multimodal image-to-image translation such as MUNIT and DRIT extends the two-phase training by disentangling *style *and *content*. Precisely, the domain-invariant features(content) such as background, angle of the face, and domain-specific features(style) such as long hair and beards that distinguish each domain. The IFI stage of both methods uses a self-reconstruction loss, which is the L1 loss between the source and reconstructed image(similar to cycle-consistency loss?).

## Mutual Information with StOchastic Style Representation(MISO)

*StOchastic…?*

To summarize, our goal is to learn a one-to-many mapping from domain A to B, or from the source domain S to the target domain T. Strictly speaking, the one-to-many mapping is implemented by learning p(t|s, z) where t**∈**T, s**∈**S, and z ∼ *N*(0, I).

The pipeline consists of two **style encoders** and **discriminators **for each domain and two **generators **and **conditional encoders** for each direction. In the figure above, E_A and E_B represent the style encoders, E_BA and E_AB represent conditional encoders, D_A and D_B represent discriminators, and G_AB and G_BA represent generators each for the domain corresponding to their subscript.

The z vector conceptually represents the desired style of the image because it directly influences the multimodality of the mapping. Style encoders(not conditional encoders) receive an image from either domain A or B and predict a corresponding z vector. To avoid single deterministic mappings, the encoders are VAEs that assert noise in the latent space.

The key is that the *style *is encoded from images from source domain A while the *content *is encoded from the target domain B. The pipeline is seemingly similar to the BicycleGAN pipeline.

## Mutual Information LOss(MILO)

Next, the author points out that the self-reconstruction(SR) loss widely used in multimodal translation can be problematic. Previous work suggests that SR loss fails to capture detailed features because the loss can encourage averaging pixel values. The MILO loss is suggested as an alternative for the SR loss.

Multimodal translation aims to learn the conditional distribution p(t|s, z). The authors view z as a random variable with a posterior of p(z|x), x ∈ X. This also gives randomness to the features extracted by the conditional encoder. Conceptually, MILO is designed to better utilize this randomness when measuring the loss.

The MILO loss maximizes the *mutual information* between feature z_a=E_A(z) and the image generated from that feature G_BA(b, z_a). The mutual information is approximated as the equation below, based on InfoGAN.

This is rewritten as the formula below after approximating distributions based on various statistical properties of the components of MISO. The formula below denoted as L_info can be calculated straightforwardly, where µ_out and σ_out are outputs of the encoder.

For full detail, refer to the original paper. I admit that I wasn’t able to interpret the full process due to the heavy math load.

The full objective function involves a combination of this MISO loss, KL-divergence, cycle-consistency loss, and adversarial loss. These losses are the same as classic equations we commonly use, and details about when each loss is used are described in the figure describing the training pipeline. Training the generator is done using the equation below that computes a weighted sum of each loss.

## Experiments

The method is evaluated on 4 unpaired image-to-image translation datasets: Male ↔ Female, Art ↔ Photo, Summer ↔ Winter, and Cat ↔ Dog. MISO was able to outperform other unpaired multimodal translation models in many metrics.

MISO achieves the best performance on CelebA sex-conditional image generation when compared to classification accuracy with other multimodal and non-multimodal techniques. This suggests that MISO generates images that successfully contain domain-specific features.

MISO was the most preferable method on user studies and perceptual metrics(LPIPS) compared with other unpaired translation methods. Lower LPIPS between I ↔O means the content is preserved and higher LPIPS between O↔O means that the outputs are more diverse. For example, NycleGAN seems to be generating realistic but not diverse images.

The results are somewhat obvious when we actually compare examples. In the figure below, MISO generates images of incomparable variety and quality.

More qualitative analysis on the latent space and examples of generated images are provided in the original paper.

## Summary

- This paper proposes an improved pipeline for unpaired multimodal image-to-image translation and improves the perceptual quality in various settings.
- This paper proposes a MILO information loss inspired by viewing the z as a random variable that replaces the problematic self-reconstruction loss.

I learned that image-to-image translation is an inherently multimodal problem. The MISO framework proposed in this paper was interesting, in that it modeled the abstract concept of style and content.