Generative modeling using PixelCNN with codes explained

Autoregressive models for generating images

Mehul Gupta
Data Science in your pocket
7 min readJan 16, 2023

--

Photo by Vadim Bogulov on Unsplash

Resuming back my generative modeling blog series, in my 101st post, after covering

Basics of generative modeling

Naive Bayes as a Generative model

Variational Autoencoders

Basic GAN architecture, and

CycleGans

My debut book “LangChain in your Pocket” is out now

I will be exploring PixelCNN for generating images for the FashionMNIST dataset alongside the codes.

PixelCNN belong to the family of AutoRegressive models.

Wait a minute, you must have heard the ‘Auto Regression’ term elsewhere if worked with Time Series. Try to recollect statistical Time Series models (if you have read about them), AR (Auto Regression), or ARMA, a combination of AutoRegression and Moving Average.

So what actually is AutoRegression?

So, assuming you already know Regression where we have a dependent variable on multiple/single independent variables

y = m*A + m1*B + m2*D ...+C

where m,m1,… mn are coefficients and A, B, D, etc are independent variables.

Now in the case of AutoRegression

y = m*yₜ₋₁ + m1*yₜ₋₂ + m2*yₜ₋₃….+C

Where yₜ₋₁, yₜ₋₂, yₜ₋₃ are past values of y. Hence, we are trying to predict the current value using the past values.

Coming back to generative modeling, can the idea of autoregression be applied to Image generation as well? that means,

  • We start considering images as a sequence of pixels (flatten the image, then this can make sense)
  • Using past pixels, predict the next pixel (something similar to what LSTMs do)

This is what the idea of PixelCNN is based on.

PixelCNN vs CNN?

So, in a conventional CNN, we don’t consider the image as a sequence of pixels, but we take the whole image as input, apply filters using Conv Layer, max/avg pool, and other operations and get to the output.

In the case of PixelCNN, we are doing the same things but using the Masked version on the kernel in Conv Layers.

What is Masking?

We will apply any filter in the Conv Layer to not all values but to a few values hence masking certain values for the model

So if we,

  • Flatten the image (append each row one behind the other to form a 1D array).

Note: This is just an assumption to make you visualize better, we won’t be flattening the image.

  • Predict one pixel at a time (similar to LSTM) using the previously predicted pixels. So if you are predicting the 25th pixel in the 1D array, use only the previous 24 pixels already pixel

How do you implement convolution filters in PixelCNNs?

If you notice while applying any filter on a particular pixel in a conventional CNN, we are using both pixels behind it and ahead of it (visualize as a 1D array). In the case of PixelCNN, we would be setting the filter value to 0 (hence masked) for any pixel ahead of the current pixel. This is how Masking takes place.

Also, we would be using residual connections in PixelCNN.

So residual connections are nothing but adding the input of a particular layer to the output of that layer. This is done to avoid any information loss.

general residual connection

So, as we are clear with PixelCNN, let’s get down with the codes and demo

We would be using the sample code provided in Keras documentation with Fashion MNIST data

1. Importing libraries

import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tqdm import tqdm

Download the Fashion MNIST data from here and expand before moving ahead

2. Preparing data

num_classes = 10
input_shape = (28, 28, 1)
n_residual_blocks = 5

train = pd.read_csv('fashion-mnist_train.csv')
data = train.drop(['label'],axis=1).to_numpy().reshape(-1,28,28,1)

data = np.where(data < (0.33 * 256), 0, 1)
data = data.astype(np.uint8)

A brief about the data

  • Fashion MNIST has 10 classes with 60k images with image shape=28x28
  • As we have the data stored in CSV, this requires reshaping
  • As values range from 0–256 for pixels, we are setting each pixel as 0 & 1 based on some threshold (0 if <85 else 1) and setting datatype as unsigned Int to save on memory.

The samples may look something like this

3. PixelCNN architecture

We will be declaring 2 classes, One for masked convolution and the other for residual connections

A. PixelConvLayer

class PixelConvLayer(layers.Layer):
def __init__(self, mask_type, **kwargs):
super().__init__()
self.mask_type = mask_type
self.conv = layers.Conv2D(**kwargs)

def build(self, input_shape):
# Build the conv2d layer to initialize kernel variables
self.conv.build(input_shape)
# Use the initialized kernel to create the mask
kernel_shape = self.conv.kernel.get_shape()
self.mask = np.zeros(shape=kernel_shape)
self.mask[: kernel_shape[0] // 2, ...] = 1.0
self.mask[kernel_shape[0] // 2, : kernel_shape[1] // 2, ...] = 1.0
if self.mask_type == "B":
self.mask[kernel_shape[0] // 2, kernel_shape[1] // 2, ...] = 1.0

def call(self, inputs):
self.conv.kernel.assign(self.conv.kernel * self.mask)
return self.conv(inputs)

Summarizing the above code block

  • A mask is nothing but a 0/1 matrix which we would multiply with the kernels we would be using for the Conv layer
  • We are setting some values of the mask as 0 and 1 using the kernel size and type of mask.
  • In the call() function, we are assigning the kernel for the Conv2d layer as actual_kernel x mask so as to hide some value of the kernel leading to masking.

Wait a minute, what is mask type?

As you must have observed, there is something called mask_type being used. What’s that? So 2 types of masks are used in PixelCNN while masking

Type A: While applying masking to the kernel, exclude the center-most pixel (hence center most value is 0)

Type B: While applying masking to the kernel, include the center-most pixel (hence center most value is 1)

For example, for kernel size 7x7, Mask A & Mask B have the below images

Observe how we have a value of 0 for mask A at the center but not for Mask B

4. Residual blocks

class ResidualBlock(keras.layers.Layer):
def __init__(self, filters, **kwargs):
super().__init__(**kwargs)
self.conv1 = keras.layers.Conv2D(
filters=filters, kernel_size=1, activation="relu"
)
self.pixel_conv = PixelConvLayer(
mask_type="B",
filters=filters // 2,
kernel_size=3,
activation="relu",
padding="same",
)
self.conv2 = keras.layers.Conv2D(
filters=filters, kernel_size=1, activation="relu"
)

def call(self, inputs):
x = self.conv1(inputs)
x = self.pixel_conv(x)
x = self.conv2(x)
return keras.layers.add([inputs, x])

For residual connection,

  • We are passing the input to a Conv layer→Masked_Conv →Conv layer
  • The final output is then merged with the original input to the 1st Conv layer
  • We have 5 such layers in PixelCNN (visible in the final architecture summary below)

Now, as we have defined both masked Conv & residual connection, It’s time to integrate everything to form PixelCNN.

5. PixelCNN architecture

inputs = keras.Input(shape=input_shape)
x = PixelConvLayer(
mask_type="A", filters=128, kernel_size=7, activation="relu", padding="same"
)(inputs)

for _ in range(n_residual_blocks):
x = ResidualBlock(filters=128)(x)

for _ in range(2):
x = PixelConvLayer(
mask_type="B",
filters=128,
kernel_size=1,
strides=1,
activation="relu",
padding="valid",
)(x)

out = keras.layers.Conv2D(
filters=1, kernel_size=1, strides=1, activation="sigmoid", padding="valid"
)(x)

pixel_cnn = keras.Model(inputs, out)
adam = keras.optimizers.Adam(learning_rate=0.0005)
pixel_cnn.compile(optimizer=adam, loss="binary_crossentropy")

pixel_cnn.summary()

The architecture is pretty self-explanatory where we have a Masked Conv layer (Mask A) followed by 5 residual layers followed by 2 Masked Conv layers (Mask B). Once done, The final output layer is a general Conv layer with ‘sigmoid’ activation (why? we have pixel values from 0-1) so as to generate an Image as an output.

6. Fitting

pixel_cnn.fit(
x=data, y=data, batch_size=128, epochs=50, validation_split=0.1, verbose=1
)

Both input and label would be the same image. As training a PixelCNN takes a long time, I have trained my dummy model for just 2 epochs. Let’s see the results below.

7. Visualize results

batch = 4
pixels = np.zeros(shape=(batch,) + (pixel_cnn.input_shape)[1:])
batch, rows, cols, channels = pixels.shape

for row in tqdm(range(rows)):
for col in range(cols):
for channel in range(channels):
probs = pixel_cnn.predict(pixels)[:, row, col, channel]

pixels[:, row, col, channel] = tf.math.ceil(
probs - tf.random.uniform(probs.shape)
)

def deprocess_image(x):
x = np.stack((x, x, x), 2)
x *= 255.0
x = np.clip(x, 0, 255).astype("uint8")
return x

This part is a little tricky. As you remember, PixelCNN is autoregressive, hence we would be predicting one pixel at a time given previously predicted pixels.

  • Initialize the image array with all 0 values
  • Iterating over each pixel (nested for loops over rows, and columns)
  • Predict the next pixel probability. Observe you have fed the entire image predicted but stored value for just the current pixel in ‘probs’ while calling pixel_cnn(pixels).
  • Set the pixel value as 0–1 using math.ceil
  • Loop over each pixel until every pixel is predicted

The deprocess_image function does nothing but 1) Convert single channel image to RGB 2)rescale the image from 0–1 to 0–256

8. Visualize images

fig,ax = plt.subplots(2,2,figsize=(5,5))
for x in range(2):
for y in range(2):
ax[x,y].imshow(deprocess_image(np.squeeze(pixels[2*x+y], -1)))

As you can see, the results are poor but justified given I have trained the model just for 2 epochs. This was just for the sake of the demo hence my end goal was not to generate great results but to get a working pipeline.

With this, let’s call it a wrap, see you next time with Conditional GANs

--

--