Day 5: Conditional Image Generation with PixelCNN Decoders

Francisco Ingham
A paper a day avoids neuron decay
5 min readMar 16, 2019

[Jun 16, 2016] Creating images from a vector

Full-blast to image generation, pixel-by-pixel. It’s not rocket science.

TL-DR

This paper explores the potential for conditional image modelling by adapting and improving a convolutional variant of the PixelRNN architecture.

Pixel-CNN can be conditioned on a vector to generate similar images. This vector can be either a series of labels representing ImageNet categories or an embedding produced by a convolutional network trained on face images.

If you didn’t please refer to my PixelRNN post to learn about the original PixelCNN architecture.

Gated Pixel CNN

The conditional distribution is the same as PixelRNN

PixelCNN’s model the probability of pixel’s color given the color of previous pixels. In particular, for each pixel and each channel (RGB) it estimates the probability of each value of the channel given previous pixels and channels (e.g. if predicting for G it considers previous pixels and the R channel for the current pixel). To avoid the pixel seeing later pixels the pixels that cannot be used are masked in the following fashion:

Example mask for the prediction of the middle pixel

Gated Convolutional Layer

LSTM’s have overperformed CNN’s as generative models. In fact, the same authors released a paper introducing PixelRNN where they presented an LSTM approach for image generation. This is mainly due to two advantages which the authors counteracted to use CNN’s.

  1. RNN advantage: Recurrent connections allow every layer to access the information from all of the previous pixels. Workaround: deep convolutional layers increase the receptive field of a convolutional layer.
  2. RNN advantage: RNN’s contain multiplicative units (gates) which allow to the model to create highly complex representations of the input. Workaround: change the ReLU activation function by the following gated activation unit:
*: Convolution operator, ⦿: element-wise multiplication operator, σ: sigmoid function, k: layer index, W: learnable convolution filter

Blind spot in the receptive field

By using traditional convolutions, there is a significant area which the layer never gets to see/use (see blind spot in the following image). To solve this the authors used two stacks to generate the pixels: a horizontal stack (conditions only on the current row) and a vertical stack (conditions on all the rows above). Each layer in the horizontal stack takes as input the output of the previous horizontal stack layer and the same of the vertical stack.

Blind spot and vertical stack

In the following figure we can see the architecture which has the vertical stack on the left and the horizontal stack on the right. There are three main differences between the two:

  1. The vertical stacks ‘feeds’ the horizontal stack its input
  2. The horizontal stack has a skip-connection (1)
  3. The masked convolutions are different: the vertical stack has a mask which excludes the rows below the current one and the horizontal stack has a mask which excludes pixels to the right of the current pixel
Layer of Gated PixelCNN. Convolution operations are shown in green, element-wise and additions are showed in red.

Conditional PixelCNN

A conditional PixelCNN is similar to a conditional WaveNet if you read my Day 3 Blogpost. Basically it models the same distribution but conditioned a new input h:

Distribution for conditional PixelCNN

What could h be? Well h could be an object we want the created image to have in a specific pose, with a specific color. We can condition the generated image.

PixelCNN Auto-Encoders

This potential for conditional generation could be applied to auto-encoders. You could use the PixelCNN as a decoder and the result should be quite good since an unconditional PixelCNN is already a great image generator.

Results

In CIFAR10 the model did much better than PixelCNN and only slightly underperformed PixelRNN.

Results on CIFAR-10

In the ImageNet dataset the model achieved state of the art performance beating PixelRNN. The authors argue this happened because the dataset was larger and previous models were underfitting, whereas they used a large model with 20 layers.

Results on ImageNet

Conditioning on an ImageNet class

The authors conditioned on several ImageNet classes (one-hot encoding vector of the class) and recorded the results. The losses were not lower when conditioning but the images look much better and definitely relate to the object in question.

Conditioning on Portrait Embeddings

The authors also conditioned on the feature map of the top layer of a network trained on faces. The image below shows the representations the model created out of a series of inputs. As we can see the embeddings captured important facial features so that the model could create similar faces in different poses, with different lighting etc.

New portraits generated from latent representations

PixelCNN Auto-Encoder

The PixelCNN was also used as a decoder in an auto-encoder. The results were compared to a standard convolutional auto-encoder, trained to minimize MSE. As we can see, for the PixelCNN auto-encoder creates similar images but with objects in other positions and with other sizes. This suggests that the encoding (embedding) is probably different.

From left to right: original image, reconstruction by standard auto-encoder, reconstruction with PixelCNN. m: size of the embededing.

Notes

(1) The authors do mention that they tried adding a skip-connection to the vertical stack but performance was not affected.

References

Conditional Image Generation with PixelCNN Decoders; van de Oord et al., Google Deep Mind, 2016

Image source: pexels.com

--

--