Published in



Review — CoGAN: Coupled Generative Adversarial Networks (GAN)

With Weight Sharing, Generates Correlated Outputs in Different Domains for the Same Input, Outperforms CGAN

Face Generation With and Without Smiling

In this story, Coupled Generative Adversarial Networks, (CoGAN), by Mitsubishi Electric Research Labs (MERL), is reviewed.

The paper concerns the problem of learning a joint distribution of multi-domain images from data.

In this paper:

  • A single input vector can generates correlated outputs in different domains through multiple GANs with weight sharing.
  • Possible applications: Producing color image and depth image where these two images are highly correlated, i.e. describing the same scene, or images of the same face with different attributes (smiling and non-smiling).

This is a paper in 2016 NIPS with over 1100 citations. (Sik-Ho Tsang @ Medium)


  1. Coupled Generative Adversarial Network (CoGAN)
  2. Experimental Results

1. Coupled Generative Adversarial Network (CoGAN)

Coupled Generative Adversarial Network (CoGAN)
  • CoGAN as illustrated in the above figure is designed for learning a joint distribution of images in two different domains.
  • It consists of a pair of GANs — GAN1 and GAN2; each is responsible for synthesizing images in one domain.

With weight sharing, a trained CoGAN can be used to synthesize pairs of corresponding images — pairs of images sharing the same high-level abstraction but having different low-level realizations.

1.1. Generators

  • Both g1 and g2 are realized as multilayer perceptrons (MLP):
  • where g(i)1 and g(i)2 are the ith layers of g1 and g2 and m1 and m2 are the numbers of layers in g1 and g2.
  • Through layers of perceptron operations, the generative models gradually decode information from more abstract concepts to more material details.
  • The first layers decode high-level semantics and the last layers decode low-level details.
  • No constraints are enforced to the last layers.

The idea is to force the first layers of g1 and g2 to have identical structure and share the weights.

With weight sharing, the pair of images can share the same high-level abstraction but having different low-level realizations.

1.2. Discriminators

  • The discriminative models map an input image to a probability score, estimating the likelihood that the input is drawn from a true data distribution.
  • The first layers of the discriminative models extract low-level features, while the last layers extract high-level features.
  • Similar to generator, the last layers are weight shared.

But it is later found out that it does not help much on the quality of the synthesized images. But still, the weight sharing is used.

This is because the weight-sharing constraint in the discriminators helps reduce the total number of parameters in the network, though it is not essential for learning a joint distribution.

1.3. Learning

  • In the game, there are two teams and each team has two players.
  • Similar to GAN, CoGAN can be trained by back propagation with the alternating gradient update steps.

Basically, the alternating gradient update steps are to train 2 discriminators one by one, then to train 2 generators one by one alternatively.

  • The network architectures are different for different applications such as for digit generation and face generation, as below.
Network architecture for digit generation
Network architecture for face generation
  • (The network architectures and details of training are in the supplementary material of the paper. Please feel free to visit the paper.)

2. Experimental Results

2.1. Digit Generation

Left: Edge MNIST, Right: Negative MNIST
  • Left (Task A): As seen, with the same input vector, CoGAN can generate the same digit image with normal and edge-based form.
  • Right (Task B): Similar results for positive and negative MNIST.
  • The figures plot the average pixel agreement ratios of the CoGANs with different weight-sharing configurations for Task A and B. The larger the pixel agreement ratio the better the pair generation performance.

It is found that the performance was positively correlated with the number of weight-sharing layers in the generative models but was uncorrelated to the number of weight-sharing layers in the discriminative models.

  • For comparison, Conditional GAN (CGAN) is implemented. With 0 input as condition into the CGAN, the generator resembles images in the 1st domain; otherwise, it generates images in the 2nd domain.
  • For Task A, CoGAN achieved an average ratio of 0.952, outperforming 0.909 achieved by the CGAN.
  • For Task B, CoGAN achieved a score of 0.967, which was much better than 0.778 achieved by the CGAN.

2.2. Face Generation

Generation of face images with different attributes using CoGAN.
  • From top to bottom, the figure shows pair face generation results for the blond-hair, smiling, and eyeglasses attributes.
  • For each pair, the 1st row contains faces with the attribute, while the 2nd row contains corresponding faces without the attribute.

As traveling in the space, the faces gradually change from one person to another. Such deformations were consistent for both domains.

Note that it is difficult to create a dataset with corresponding images for some attribute such as blond hair since the subjects have to color their hair.

2.3. Color and Depth Images Generation

Generation of color and depth images using CoGAN.
  • The top figure shows the results for the RGBD dataset: the 1st row contains the color images, the 2nd row contains the depth images, and the 3rd and 4th rows visualized the depth profile under different view points.
  • The bottom figure shows the results for the NYU dataset.

The CoGAN recovered the appearance–depth correspondence unsupervisedly.

2.4. Potential Applications

Unsupervised domain adaptation performance comparison.
  • Unsupervised Domain Adaptation (UDA): UDA concerns adapting a classifier trained in one domain to classify samples in a new domain where there is no labeled example in the new domain for re-training the classifier.
  • (Domain Adaptation is not the main contribution in this paper. So, I don’t take a deep look into it. If interested, please feel free to read the paper.)
Cross-domain image transformation.
  • Cross-domain image transformation: For each pair, left is the input; right is the transformed image.
  • (Authors just want to introduce the potential applications in this part. If interested in the details, please refer to the paper.)

Later on, authors extend CoGAN to have Image-to-image translation, and it is published in 2017 NIPS. Hope I can review it later in the coming future.


[2016 NIPS] [CoGAN]
Coupled Generative Adversarial Networks

Generative Adversarial Network (GAN)

Image Synthesis [GAN] [CGAN] [LAPGAN] [DCGAN] [CoGAN]
Image-to-image Translation [Pix2Pix]
Super Resolution [SRGAN & SRResNet] [EnhanceNet] [ESRGAN]
Blur Detection [DMENet]
Camera Tampering Detection [Mantini’s VISAPP’19]
Video Coding
[VC-LAPGAN] [Zhu TMM’20] [Zhong ELECGJ’21]

My Other Previous Paper Readings

Everything connected with Tech & Code. Follow to join our 900K+ monthly readers

Recommended from Medium

Weekly-mendations #019

How to easily download bulk images for training a model on Teachable Machine

Reading: CNNIF & CNNMC — Image Compression Using VVC, 1st & 2nd Places in CVPR 2018 CLIC (Codec…

Demystifying K-Means Clustering

Build Custom Image Classification Models for Mobile with Flutter, ML Kit, and AutoML

Google Cloud Professional Machine Learning Engineer Certification Preparation Guide

Streamline Model Development with GitLab’s DevOps Platform and Comet

Unsupervised Machine learning: an Introduction to Clustering Algorithms

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Sik-Ho Tsang

Sik-Ho Tsang

PhD, Researcher. I share what I've learnt and done. :) My LinkedIn:, My Paper Reading List:

More from Medium

Review: Vision Transformer (ViT)


MAE/SimMIM for pre-training like a masked language model

OMNIVORE: A Single Model for Many Visual Modalities |Paper Summary|