Generating Images

Boda Vijay
Analytics Vidhya
Published in
10 min readJun 19, 2021

Introduction

This blog deals with the problem of generating images with one application. The idea is to generate the same image through the model for a given sample image. The application is to use the model architecture and complete the occluded(half-filled) image. A basic encoder-decoder and deep CNN encoder-decoder models are implemented from scratch, trained, and analysed on three datasets. The Analysis is also on finding a good size hidden representation of the image for every dataset, which can be used for applications.

Related Work

Some well-known approaches for image generation are Autoencoders, Generative Adversarial Networks(GANs), Auto-Regressive models(PixelRNN, PixelCNN), DRAW.

Auto-Regressive models attempt to model the data distribution by estimating the data density. Intuitively, it is used for predicting the next sequence for a given previous input sequence.

Generative Adversarial Networks(GANs) are generative models. Two models are trained simultaneously by an adversarial process. A generator learns to generate realistic images, while a discriminator learns to distinguish between
real and fake images. The discriminator’s role is to decide if the image is from a real dataset or a generator.

The Deep Recurrent Attentive Writer (DRAW) architecture reflects a step toward a more natural mode of image creation, in which parts of a scene are created independently of one another, and approximate sketches are refined over time. A pair of recurrent neural networks form the foundation of the DRAW architecture: an encoder network compresses the actual images presented during training, and a decoder network reconstructs images after receiving codes. The combined system is trained using stochastic gradient descent from start to finish, with the loss function being a variational upper bound on the data’s log-likelihood. The decoder and encoder are recurrent networks in DRAW so that the exchange of a sequence of code samples can happen. The previous outputs of the decoder are fed to the encoder, which allows it to send codes based on the behaviour of the decoder so far. The network can decide what to write, where to write and where to read.

Filling the occluded image depends on the previous pixels. In other words, it is a sequence-based problem, The CNN architecture model may not capture the sequence, but PixelRNN is a kind of model that predicts the next pixel based on the previous pixel.

We will be using the deep CNN Encoder-Decoder model. We can modify the model by making encoder as CNN and LSTM as Decoder, as filling the occluded image is a sequence problem. refer Image Denoising with LSTM

Dataset Description:

The three datasets used are MINST digit, Labelled Faces in the Wild(LFW) deep funnelled and CIFAR-100. We are not concerned about labels for all the datasets. The MNIST data contains 60,000 samples. The LFW dataset contains 13,233 images of 5,749 people’s face. The CIFAR-100 data contains 60,000 samples with 50,000 as training and 10,000 as testing data. These datasets are used for various image recognition tasks, but we will mainly focus on recreating them after storing them in a smaller dimension.

Model:

For analysis, a basic encoder-decoder model is created along with a deep CNN based model.

Model Architecture of Encoder and Decoder

Basic Encoder-Decoder Model:

The basic encoder-decoder model(basic model) consists of an encoder and a decoder. The encoder has an input layer, Flatten layer and a dense layer. We feed input to the input layer, then we flatten the image and feed it to a Dense layer. In the Dense layer, we can specify the latent dimension value for the internal representation of the image. The decoder has an input layer, a dense layer, and a reshape layer. The size of the dense layer is same as the product of dimensions of the input image. The decoder takes the latent dimension value as input in the input layer of the decoder. It is fed to the Dense layer and finally, we reshape the image to make it three dimensional.

The Encoder plot of the basic model is

Fig 1.1 Encoder of Basic Model

Decoder plot of the basic model

Fig 1.2 Decoder of the basic model

Code for the basic model is as follows

Deep CNN Model:

The architecture of the model is the same as the basic model, i.e. encoder-decoder. In the encoder, we stack four convolutional layers and pooling layers, flatten the image and finish with a dense layer. In the dense layer, we provide the latent dimension value for representation. The Activation function we used is ‘relu’. The kernel size for each convolutional layer is (3,3), with padding as same. The output of each convolutional layers is 32, 64, 128, 256, respectively.

In the decoder, we have to perform undo operations on convolution layers, that is, transpose convolution. A convolutional layer takes a part of the image and produces a number, whereas transpose convolution takes the number and produces the part of the image. The first layer of the decoder has an input layer that takes the shape of the latent dimension specified in the encoder. Next, it is connected to the Dense layer with the number of units equal to the product of the original image size. It is later reshaped and fed to the transpose convolutional layers. We stack four Conv2DTranspose layers with the output channels as 128, 64, 32, and 3. “relu” is used as an activation function for the first three layers, and since the last layer is our final image, we don’t include any activation function there.

Code and plot (Fig. 2) for the deep CNN model is as follows:

The encoder of the deep CNN model

Fig 2.1 The encoder of the deep CNN model

The decoder of the deep CNN model

Fig 2.2 The decoder of the deep CNN model

Since we are generating the images while training the model we give the train data in place of both input and output of the model.

After training with these models, we have recreated the images for the given samples. The application part is to complete the occluded (half-filled) image. For this, we have created a function for generating occluded images, as shown in Fig. 3. The code for this task is as follows

Fig. 3, Occluded images of (Top) MINST data, (Middle) LFW data, (Bottom) CIFAR data

So, while training this model we give the occluded image as input and the original image as target.

During inference, we first give the input image to the encoder and then provide the output produced by the encoder to the decoder as input. The decoder predicts the required output as per the problem we are solving. And we generate plots of those outputs.

Results:

The analysis is performed on three datasets, as mentioned above. We wanted to find the smallest latent dimension size with a good image generation. Below we included images of reconstructed images and completion of half-filled images, plots of training and validation errors. The error metric is the mean square error, and “adam” is the optimizer. We show all these with different latent size and for different datasets. Results are for both the basic and deep CNN model.

We will start with MNIST data; some samples are

Fig 4, MNIST data samples.

Results of Basic Model for MNIST

Recreated image(Fig. 5) for the basic model with the latent size as 32 has MSE = 0.017 for test data.

Reconstructed, Latent size = 32

Fig 5

Occluded image, Latent size: 32

Completing the occluded image(Fig. 6) for the basic model with the latent size of 32 has MSE = 0.023 for test data.

Fig 6

Results of Deep Model for MNIST

Reconstructed, Latent size: 32

Recreated image(Fig. 7) for the deep model with the latent size as 32 has MSE = 0.0030 for test data.

Fig 7

Occluded, Latent size: 32

Completing the occluded image(Fig. 8) for the deep model with the latent size of 32 has MSE = 0.015 for test data.

Fig 8

Occluded, Latent size: 128

Completing the occluded image(Fig. 9) for the deep model with the latent size as 128 has MSE = 0.0147 for test data.

Fig 9

Occluded, Latent size: 512

Completing the occluded image(Fig. 10) for the deep model with the latent size as 512 has MSE = 0.0146 for test data.

Results of Basic Model for LFW

Now we show images of the LFW dataset. some samples are

Fig 11. Original Samples of LFW data

Reconstructed, Latent size: 32

Recreated image(Fig. 12) for the basic model with the latent size as 32 has MSE = 0.0056 for test data.

Fig 12

Reconstructed, Latent size: 32

Completing the occluded image(Fig. 13) for the basic model with the latent size of 32 has MSE = 0.0078 for test data.

Fig 13. (LEFT) (Top) Occluded samples, (Middle) filled samples (Bottom) Original Samples for the basic model, (RIGHT) Training and Validation error curves

Results of Deep Model for LFW

Reconstructed, Latent size: 16

Recreated image(Fig. 14) for the deep model with the latent size of 16 has MSE = 0.0068 for test data; as you can see, that basic model performs better than the deep when the latent dimensions of the deep model are less than that of the basic model.

Fig 14

Reconstructed, Latent size: 32

Recreated image(Fig. 15) for the deep model with the latent size of 32 has MSE = 0.0052 for test data.

Fig 15

Reconstructed, Latent size: 64

Recreated image(Fig. 16) for the deep model with the latent size of 64 has MSE = 0.0038 for test data.

Fig 16

Occluded, Latent size: 64, 128, 512

Completing the occluded image(Fig. 17) for the deep model with the latent size of 64 has MSE = 0.0080, size 128 has 0.0076, size 512 has 0.0078 for test data.

Fig 17. (LEFT) (Top) Occluded samples, (Middle) completely filled samples (Bottom) Original Samples for the deep model, for latent sizes 64, 128, 512 respectively

Finally, we show results for CIFAR data; some samples are

Fig. 18 Examples of CIFAR-100 data

Results of Basic Model for CIFAR

Reconstructed, Latent size: 32

Recreated image(Fig. 19) for the basic model with the latent size of 32 has MSE = 0.0132 for test data.

Occluded, Latent size: 32

Completing the occluded image(Fig. 21) for the basic model with the latent size of 32 has MSE = 0.0232 for test data.

Results of Deep Model for CIFAR

Reconstructed, Latent size: 32

Recreated image(Fig. 20) for the deep model with the latent size of 32 has MSE = 0.011 for test data.

Occluded, Latent size: 64

Completing the occluded image(Fig. 22) for the deep model with the latent size of 64 has MSE = 0.0234 for test data.

Occluded, Latent size: 128

Completing the occluded image(Fig. 23) for the deep model with the latent size of 128 has MSE = 0.0231 for test data.

Conclusion: The recreated image from the deep model is better for latent size 32 as compared to the basic model for the MNIST dataset. The difference between MSE of the deep model for given latent values is very little. So the smallest representation size of MNIST data for this model is 32, i.e., we can complete the occluded image almost perfectly with this size. For the LFW dataset, recreating is better for the deep model with latent size 64. The optimal latent size value for the occluded image problem is having a basic model with size 32. Finally, for CIFAR data, recreating has to be done with the deep model with size 32, and a basic model also performs well for completing the image along with the deep model. Since this is a sequence problem, we can extend this model by implementing sequence models like LSTM as part of the decoder to get even better results.

All the codes are available in https://github.com/bodavijay24/Generating-Images

References:

  1. Image denoising and restoration with CNN-LSTM Encoder-Decoder with Direct Attention(https://arxiv.org/pdf/1801.05141.pdf)
  2. Gregor, K., Danihelka, I., Graves, A., Rezende, D. J., and
    Wierstra, D. Draw: A recurrent neural network for image
    generation, 2015.
  3. Radford, A., Metz, L., and Chintala, S. Unsupervised representation learning with deep convolutional generative
    adversarial networks, 2016.
  4. van den Oord, A., Kalchbrenner, N., Vinyals, O., Espeholt,
    L., Graves, A., and Kavukcuoglu, K. Conditional image
    generation with pixelcnn decoders, 2016.
  5. Pixel Recurrent Neural Networks(http://proceedings.mlr.press/v48/oord16.pdf)

--

--