MSG-GAN: Multi Scale Gradients-GAN

Samples generated by the MSG-GAN at 1024 x 1024 resolution for CelebA dataset

The code for this experiment is available at my github repository here

Trained model checkpoints are available at


An excerpt from the Progressive Growing of GANs paper:

When we measure the distance between the training distribution and the generated distribution, the gradients can point to more or less random directions if the distributions do not have substantial overlap, i.e., are too easy to tell apart.
The generation of high-resolution images is difficult because higher resolution makes it easier to tell the generated images apart from training images thus drastically amplifying the gradient problem.

This highlights the problem involved with generating high resolution images using GANs. The paper proposed a solution for this by introducing the layer-wise training of the GAN. The layer-wise training mitigates the gradient problem by making the gradients at higher resolution meaningful by bringing the lower resolution distributions closer to the real distributions first.

I, on the other hand, realised that there could be another solution to this problem apart from the suggested layer-wise training. My proposed solution is that: “Gradients from the discriminator should reach all the different scales (resolutions) in the Generator”. In the following section, I explain how we can achieve this.


Instead of progressively growing the GAN, we feed all the varied scales of generated samples and original samples to the GAN simultaneously. This results in connections from intermediate layers of the Generator to the intermediate layers of the Discriminator which resembles a “U-Net” like architecture.

Architecture of the proposed MSG-GAN

As shown in the above diagram, I down-sample the original images to the appropriate resolutions so as to concatenate them with the activations of the normal convolutional layers. Since there are connections directly from the layers of the generator to the discriminator, gradients flow to all the layers simultaneously. In the beginning, as expected, the gradients in the lower layers are more sensible than the higher and eventually they make the gradients of higher layers more an more appropriate for matching the required distribution.

Process of synchronization of the multiple scales across the Generator and the Discriminator.

Above figure explains how the Meaningful Gradients penetrate the Generator from Bottoms-up. Initially, only the lower resolution gradients are meaningful and thus start generating good images at those resolutions, but eventually, all the scales synchronize and start producing images. This results in a stabler training for the higher resolution.

Celeba Experiment

Explanatory video for the MSG-GAN technique. Best viewed in 1080p 60fps

I ran an experiment training the proposed network on the Celeba Dataset. Above video shows the time-lapse that I acquired by training it. As expected, the generated images at various scales of the Generator are infact sychronised and are downsampled versions of the highest resolution generated image.

You can find about the details of the architecture in the above diagram. I used the Relativistic version of the Hinge-GAN Loss for training the network. Following diagram is the plot of the Generator and Discriminator loss recorded during training.

Loss plots of training.

There are some aberrations in the beginning, but later on the training has stayed smooth.

I was able to generate images of resolution 64 x 64 on my GPU with a relatively small network, but I’d definitely like to see what the results are for higher resolutions. You can definitely try this out on your GPU and open PR with your results. Just increase the Depth in the code from 5 (corresponds to 64 x 64) to 9 (for 1024 x 1024) and the latent_size from 256 to 512 for matching the architecture of ProGAN.

One more detail I’d like to mention here is that in my architecture, I do not use any of the stability techniques proposed in the ProGAN paper, namely: “Pixel-norm”, “equalized learning rate” and “Exponential moving weight average for generator”. I have only used the MinibatchStd layer from the ProGAN paper and have applied Spectral Normalization to all the convolutional weights in the network.

Final Thoughts

I believe that this approach of using multi-scale gradients from the discriminator can be used for generating higher resolution images as the training of all the scales is in sync and the training shows traits which are remarkably similar to the ProGAN.

I will be working on using the Full-attention layer from my previous FAGAN blog (which is a variant of SAGAN) in this architecture for further improvement. Please let me know what you think about this technique.

Feel free to provide any feedback / improvements / suggestions. Contributions to the code / technique are most welcome.

Thank you for reading!