CODEX

Review — CoGAN: Coupled Generative Adversarial Networks (GAN)

With Weight Sharing, Generates Correlated Outputs in Different Domains for the Same Input, Outperforms CGAN

Sik-Ho Tsang

Published in

CodeX

6 min readMar 28, 2021

**Face Generation With and Without Smiling**

In this story, Coupled Generative Adversarial Networks, (CoGAN), by Mitsubishi Electric Research Labs (MERL), is reviewed.

The paper concerns the problem of learning a joint distribution of multi-domain images from data.

In this paper:

A single input vector can generates correlated outputs in different domains through multiple GANs with weight sharing.
Possible applications: Producing color image and depth image where these two images are highly correlated, i.e. describing the same scene, or images of the same face with different attributes (smiling and non-smiling).

This is a paper in 2016 NIPS with over 1100 citations. (Sik-Ho Tsang @ Medium)

Outline

Coupled Generative Adversarial Network (CoGAN)
Experimental Results

1. Coupled Generative Adversarial Network (CoGAN)

CoGAN as illustrated in the above figure is designed for learning a joint distribution of images in two different domains.
It consists of a pair of GANs — GAN1 and GAN2; each is responsible for synthesizing images in one domain.

With weight sharing, a trained CoGAN can be used to synthesize pairs of corresponding images — pairs of images sharing the same high-level abstraction but having different low-level realizations.

1.1. Generators

Both g1 and g2 are realized as multilayer perceptrons (MLP):

where g(i)1 and g(i)2 are the ith layers of g1 and g2 and m1 and m2 are the numbers of layers in g1 and g2.
Through layers of perceptron operations, the generative models gradually decode information from more abstract concepts to more material details.
The first layers decode high-level semantics and the last layers decode low-level details.
No constraints are enforced to the last layers.

The idea is to force the first layers of g1 and g2 to have identical structure and share the weights.
With weight sharing, the pair of images can share the same high-level abstraction but having different low-level realizations.

1.2. Discriminators

The discriminative models map an input image to a probability score, estimating the likelihood that the input is drawn from a true data distribution.
The first layers of the discriminative models extract low-level features, while the last layers extract high-level features.
Similar to generator, the last layers are weight shared.

But it is later found out that it does not help much on the quality of the synthesized images. But still, the weight sharing is used.
This is because the weight-sharing constraint in the discriminators helps reduce the total number of parameters in the network, though it is not essential for learning a joint distribution.

1.3. Learning

In the game, there are two teams and each team has two players.

Similar to GAN, CoGAN can be trained by back propagation with the alternating gradient update steps.

Basically, the alternating gradient update steps are to train 2 discriminators one by one, then to train 2 generators one by one alternatively.

The network architectures are different for different applications such as for digit generation and face generation, as below.

**Network architecture for digit generation**

**Network architecture for face generation**

(The network architectures and details of training are in the supplementary material of the paper. Please feel free to visit the paper.)

2. Experimental Results

2.1. Digit Generation

**Left: Edge MNIST, Right: Negative MNIST**

Left (Task A): As seen, with the same input vector, CoGAN can generate the same digit image with normal and edge-based form.
Right (Task B): Similar results for positive and negative MNIST.

The figures plot the average pixel agreement ratios of the CoGANs with different weight-sharing configurations for Task A and B. The larger the pixel agreement ratio the better the pair generation performance.

It is found that the performance was positively correlated with the number of weight-sharing layers in the generative models but was uncorrelated to the number of weight-sharing layers in the discriminative models.

For comparison, Conditional GAN (CGAN) is implemented. With 0 input as condition into the CGAN, the generator resembles images in the 1st domain; otherwise, it generates images in the 2nd domain.
For Task A, CoGAN achieved an average ratio of 0.952, outperforming 0.909 achieved by the CGAN.
For Task B, CoGAN achieved a score of 0.967, which was much better than 0.778 achieved by the CGAN.

2.2. Face Generation

**Generation of face images with different attributes using CoGAN.**

From top to bottom, the figure shows pair face generation results for the blond-hair, smiling, and eyeglasses attributes.
For each pair, the 1st row contains faces with the attribute, while the 2nd row contains corresponding faces without the attribute.

As traveling in the space, the faces gradually change from one person to another. Such deformations were consistent for both domains.
Note that it is difficult to create a dataset with corresponding images for some attribute such as blond hair since the subjects have to color their hair.

2.3. Color and Depth Images Generation

**Generation of color and depth images using CoGAN.**

The top figure shows the results for the RGBD dataset: the 1st row contains the color images, the 2nd row contains the depth images, and the 3rd and 4th rows visualized the depth profile under different view points.
The bottom figure shows the results for the NYU dataset.

The CoGAN recovered the appearance–depth correspondence unsupervisedly.

2.4. Potential Applications

**Unsupervised domain adaptation performance comparison.**

Unsupervised Domain Adaptation (UDA): UDA concerns adapting a classifier trained in one domain to classify samples in a new domain where there is no labeled example in the new domain for re-training the classifier.
(Domain Adaptation is not the main contribution in this paper. So, I don’t take a deep look into it. If interested, please feel free to read the paper.)

Cross-domain image transformation: For each pair, left is the input; right is the transformed image.
(Authors just want to introduce the potential applications in this part. If interested in the details, please refer to the paper.)

Later on, authors extend CoGAN to have Image-to-image translation, and it is published in 2017 NIPS. Hope I can review it later in the coming future.

Reference

[2016 NIPS] [CoGAN]
Coupled Generative Adversarial Networks

Generative Adversarial Network (GAN)

Image Synthesis [GAN] [CGAN] [LAPGAN] [DCGAN] [CoGAN]
Image-to-image Translation [Pix2Pix]
Super Resolution [SRGAN & SRResNet] [EnhanceNet] [ESRGAN]
Blur Detection [DMENet]
Camera Tampering Detection [Mantini’s VISAPP’19]
Video Coding [VC-LAPGAN] [Zhu TMM’20] [Zhong ELECGJ’21]