Published in


How to train Multimodal Image-to-Image Translation using MISO

Examples from the paper

Image-to-image translation is the task of translating an image of a certain domain to another domain while preserving certain semantic content. There are largely two problems in image-to-image translation, which are learning from paired data and unpaired data. Collecting paired datasets can be problematic in real-world problems, and learning algorithms based on unpaired data such as the popular CycleGAN has gained more interest.

Another property of image-to-image translation is the inherent multimodal nature of the problem. A single image can be translated into multiple, equally realistic images as illustrated in the figure above. Research on unpaired multimodal image-to-image translation such as this paper aims to generate diverse images from a single image in an unpaired dataset.

This paper …

  • Uses content representation from the source domain conditioned on a style representation from the target domain(MISO pipeline).
  • Proposes Mutual Information LOss(MILO) as the loss function.
  • Improves the performance of unpaired multimodal image-to-image translation to surprising levels.

Original Paper: MISO: Mutual Information Loss with Stochastic Style Representations for Multimodal Image-to-Image Translation

Previous Approaches

We will shortly discuss the ideas of previous work on unpaired multimodal image-to-image translation. This section is written based on the MISO paper. For more detail, refer to each paper.

Multimodal mappings can be learned by mapping a pair of (noise, source image) to the target image. BicycleGAN proposes a two-phase training that translates between image and features for multimodal paired translation. Precisely, the training consists of translating X → Z → X (Image-Feature-Image, IFI) and Z → X → Z (Feature-Image-Feature, FIF), each phase trained with different loss functions.

Work on unpaired multimodal image-to-image translation such as MUNIT and DRIT extends the two-phase training by disentangling style and content. Precisely, the domain-invariant features(content) such as background, angle of the face, and domain-specific features(style) such as long hair and beards that distinguish each domain. The IFI stage of both methods uses a self-reconstruction loss, which is the L1 loss between the source and reconstructed image(similar to cycle-consistency loss?).

Mutual Information with StOchastic Style Representation(MISO)


To summarize, our goal is to learn a one-to-many mapping from domain A to B, or from the source domain S to the target domain T. Strictly speaking, the one-to-many mapping is implemented by learning p(t|s, z) where tT, sS, and z ∼ N(0, I).

Training MISO w/ two-stages

The pipeline consists of two style encoders and discriminators for each domain and two generators and conditional encoders for each direction. In the figure above, E_A and E_B represent the style encoders, E_BA and E_AB represent conditional encoders, D_A and D_B represent discriminators, and G_AB and G_BA represent generators each for the domain corresponding to their subscript.

The z vector conceptually represents the desired style of the image because it directly influences the multimodality of the mapping. Style encoders(not conditional encoders) receive an image from either domain A or B and predict a corresponding z vector. To avoid single deterministic mappings, the encoders are VAEs that assert noise in the latent space.

The key is that the style is encoded from images from source domain A while the content is encoded from the target domain B. The pipeline is seemingly similar to the BicycleGAN pipeline.

Examples from the paper

Mutual Information LOss(MILO)

Next, the author points out that the self-reconstruction(SR) loss widely used in multimodal translation can be problematic. Previous work suggests that SR loss fails to capture detailed features because the loss can encourage averaging pixel values. The MILO loss is suggested as an alternative for the SR loss.

Multimodal translation aims to learn the conditional distribution p(t|s, z). The authors view z as a random variable with a posterior of p(z|x), x ∈ X. This also gives randomness to the features extracted by the conditional encoder. Conceptually, MILO is designed to better utilize this randomness when measuring the loss.

The MILO loss maximizes the mutual information between feature z_a=E_A(z) and the image generated from that feature G_BA(b, z_a). The mutual information is approximated as the equation below, based on InfoGAN.

This is rewritten as the formula below after approximating distributions based on various statistical properties of the components of MISO. The formula below denoted as L_info can be calculated straightforwardly, where µ_out and σ_out are outputs of the encoder.

For full detail, refer to the original paper. I admit that I wasn’t able to interpret the full process due to the heavy math load.

The full objective function involves a combination of this MISO loss, KL-divergence, cycle-consistency loss, and adversarial loss. These losses are the same as classic equations we commonly use, and details about when each loss is used are described in the figure describing the training pipeline. Training the generator is done using the equation below that computes a weighted sum of each loss.


The method is evaluated on 4 unpaired image-to-image translation datasets: Male ↔ Female, Art ↔ Photo, Summer ↔ Winter, and Cat ↔ Dog. MISO was able to outperform other unpaired multimodal translation models in many metrics.

MISO achieves the best performance on CelebA sex-conditional image generation when compared to classification accuracy with other multimodal and non-multimodal techniques. This suggests that MISO generates images that successfully contain domain-specific features.

The classifier trained on CelebA attributes can successfully identify the intended sex of generated image.(F and M denote female and male.)
LPIPS distance while translation(I: input, O: output)

MISO was the most preferable method on user studies and perceptual metrics(LPIPS) compared with other unpaired translation methods. Lower LPIPS between I ↔O means the content is preserved and higher LPIPS between O↔O means that the outputs are more diverse. For example, NycleGAN seems to be generating realistic but not diverse images.

User study results

The results are somewhat obvious when we actually compare examples. In the figure below, MISO generates images of incomparable variety and quality.

More qualitative analysis on the latent space and examples of generated images are provided in the original paper.


  • This paper proposes an improved pipeline for unpaired multimodal image-to-image translation and improves the perceptual quality in various settings.
  • This paper proposes a MILO information loss inspired by viewing the z as a random variable that replaces the problematic self-reconstruction loss.

I learned that image-to-image translation is an inherently multimodal problem. The MISO framework proposed in this paper was interesting, in that it modeled the abstract concept of style and content.




Everything connected with Tech & Code. Follow to join our 900K+ monthly readers

Recommended from Medium

ResNet: A Simple Understanding of the Residual Networks

Word2vec Algorithm

Classification of Fruit Images Using Neural Networks (PyTorch)

Understanding Principal Component Analysis — PCA

Dogs vs. Cats Redux Playground Competition, Winner’s Interview: Bojan Tunguz

RL — Conjugate Gradient

Path Planning Using Potential Field Algorithm

How to create a simple Image Classifier

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Sieun Park

Sieun Park

Loves reading and writing about AI, DL💘. Passionate️ 🔥 about learning new technology. Contact me via LinkedIn:

More from Medium

Generative Adversarial Networks for Anime Face Generation — PyTorch

Implement ResNet with PyTorch

Basic Intuition And Guide to Neural Style Transfer

Neural Style Transfer