Autoencoders Explained

7 min readJun 16, 2024

Part 2: Convolutional Autoencoder (CAE)

There are several types of autoencoders, each designed for a specific type of input data or task. Here are some of the most commonly used types of autoencoders:

A Convolutional Autoencoder (CAE) is a type of autoencoder designed specifically for image data. Here, the encoder and decoder are built using convolutional neural networks (CNNs)(instead of dense layers in both the encoder and the decoder), which are particularly good at capturing the spatial relationships between the pixels in an image. In a Convolutional Autoencoder (CAE), the encoder layers are typically referred to as convolutional layers because they perform convolution operations on the input image to extract features. Similarly, the decoder layers are often called deconvolution layers because they use deconvolution or transpose convolution operations to reconstruct the output image from the compressed representation.

The encoder typically consists of a series of convolutional layers, each followed by a pooling layer. The convolutional layers apply filters to the input image to extract features, while the pooling layers reduce the spatial dimensions of the feature maps.

The output of final convolutional layer is flattened and fed into a dense layer that maps it to the lower-dimensional latent space, it is typically reshaped into a 3D tensor that can be processed by the upsampling layers of the decoder. The shape of the 3D tensor depends on the chosen architecture of the encoder and the size of the latent space. For example, if the final convolutional layer outputs feature maps of size 3x3x128, and the latent space has dimension 10, the flattened tensor would have shape (1, 1152) (assuming a batch size of 1), and it would be fed into a dense layer that maps it to the lower-dimensional latent space, which would then be reshaped to a tensor of shape (3, 3, 128), where the first two dimensions correspond to the spatial dimensions (height and width) of the feature maps, and the last dimension corresponds to the number of channels in the feature maps. This 3D tensor is then passed through upsampling layers of the decoder to reconstruct the original image.

The decoder is the reverse of the encoder and typically consists of a series of upsampling layers followed by convolutional layers. The upsampling layers increase the spatial dimensions of the feature maps, while the convolutional layers apply filters to reconstruct the original image.

The process of upsampling can also be achieved using other techniques such as nearest neighbor interpolation or bilinear interpolation, but transpose convolution is commonly used in CAEs because it allows the network to learn the upsampling process directly from the data during training. Additionally, transpose convolution can learn to fill in missing data or generate new data based on the learned patterns, which can be useful for tasks such as image inpainting or generation.

Reconstruction losses: To train an autoencoder, we need to define a loss function that measures how well the decoder can reconstruct the input from the hidden representation. The reconstruction error is a measure of the difference between the input data and the reconstructed data and is typically quantified using a loss function. There are different types of loss functions that can be used for training autoencoders, depending on the nature of the input data and the desired output.

Mean squared error (MSE): The mean squared error (MSE) loss is the most commonly used loss function for reconstruction in autoencoders, which computes the average squared difference between the input and the output pixels. The goal is to minimize this loss function by adjusting the weights of the encoder and decoder networks. The autoencoder is trained using backpropagation, where the gradient of the MSE loss with respect to the weights is computed and used to update the weights of the network. The MSE loss for a single example is calculated by taking the difference between each corresponding pixel in the input and output, squaring the difference, and taking the average over all pixels. The formula for calculating the MSE loss is as follows:

where x represents the input pixel and x’ represents the reconstructed output pixel and N is the total number of pixels in the input. If the MSE loss is low, it indicates that the autoencoder is able to reconstruct the input data accurately. Conversely, a high MSE loss indicates poor reconstruction performance.

Binary cross-entropy (BCE): This is another common loss function for autoencoders, especially used when the input data is binary(such as images with black and white pixels) or normalized to [0, 1]. For example, suppose that the actual pixel value of a certain pixel is 0.7 and the output of the autoencoder for that pixel is 0.4. The formula for binary cross-entropy (BCE) loss in an autoencoder is:

BCE = -(1/N) * sum(x * log(y) + (1-x) * log(1-y))

where x is the actual pixel value of the input data and y is the corresponding reconstructed output pixel. N refers to the total number of pixels in the input data. For example, if the input data is an image of size 28 x 28, then N would be equal to 784 (i.e., 28 x 28). The formula for BCE loss takes the sum of the binary cross-entropy losses for each pixel in the input data, which is why the formula includes the summation term.

Note — If the input data is continuous or real-valued, MSE might be more suitable than BCE.

Python Implementation of basic autoencoder:

from keras.datasets import mnist
from keras.layers import Input, Dense, Flatten
from keras.models import Model

# Defining model architecture and model training:
(x_train, _), (x_test, _) = mnist.load_data()
x_train = x_train.astype('float32') / 255.
x_test = x_test.astype('float32') / 255.

# Reshape the data into a 784-dimensional vector
x_train = x_train.reshape((len(x_train), np.prod(x_train.shape[1:])))
x_test = x_test.reshape((len(x_test), np.prod(x_test.shape[1:])))

# Define the input layer
input_image = Input(shape=(784,))

# define encoder layers
encoder_layer1 = Dense(128, activation='relu')(input_image)
encoder_layer2 = Dense(64, activation='relu')(encoder_layer1)
encoder_layer3 = Dense(32, activation='relu')(encoder_layer2)

# define decoder layers
decoder_layer1 = Dense(64, activation='relu')(encoder_layer3)
decoder_layer2 = Dense(128, activation='relu')(decoder_layer1)
decoder_outputs = Dense(784, activation='sigmoid')(decoder_layer2)

# define autoencoder model
autoencoder = Model(inputs=input_image, outputs=decoder_outputs)

# compile model
autoencoder.compile(optimizer='adam', loss='binary_crossentropy')

autoencoder.fit(x_train, x_train,
                epochs=50,
                batch_size=256,
                shuffle=True,
                validation_data=(x_test, x_test))

# Visualizing encoded representation:
import matplotlib.pyplot as plt

# extract encoder output
encoder = Model(inputs=autoencoder.input, outputs=autoencoder.get_layer("encoder_layer3").output)
encoded_imgs = encoder.predict(x_test)

# reshape encoded images
encoded_imgs_reshaped = encoded_imgs.reshape((len(encoded_imgs), -1, 4, 4))

# display some encoded images
n = 10  # number of images to display
plt.figure(figsize=(20, 4))
for i in range(n):
    # display original image
    ax = plt.subplot(2, n, i + 1)
    plt.imshow(x_test[i].reshape(28, 28))
    plt.gray()
    ax.get_xaxis().set_visible(False)
    ax.get_yaxis().set_visible(False)

    # display encoded image
    ax = plt.subplot(2, n, i + n + 1)
    plt.imshow(encoded_imgs_reshaped[i][0])
    plt.gray()
    ax.get_xaxis().set_visible(False)
    ax.get_yaxis().set_visible(False)
plt.show()

# Evaluate the model on the test set:
score = autoencoder.evaluate(x_test, x_test, verbose=0)
print('Test loss:', score)

Output:

Test loss: 0.08174165338277817

# Making predictions and comparing generated images with original images:

# Generate reconstructed images from the test set
reconstructed_imgs = autoencoder.predict(x_test)

# Visualize some of the original and reconstructed images
import matplotlib.pyplot as plt

n = 10  # number of images to display
plt.figure(figsize=(20, 4))
for i in range(n):
    # Original image
    ax = plt.subplot(2, n, i + 1)
    plt.imshow(x_test[i].reshape(28, 28))
    plt.gray()
    ax.get_xaxis().set_visible(False)
    ax.get_yaxis().set_visible(False)

    # Reconstructed image
    ax = plt.subplot(2, n, i + n + 1)
    plt.imshow(reconstructed_imgs[i].reshape(28, 28))
    plt.gray()
    ax.get_xaxis().set_visible(False)
    ax.get_yaxis().set_visible(False)
plt.show()

Note — Minimizing the MSE alone may not be enough to ensure that the learned representation is meaningful. If the input data is noisy, the autoencoder may find it difficult to separate the noise from the actual features of the data and may learn to reproduce the noise along with the important features. This can result in a poor quality reconstruction that includes the noise, making it harder to extract meaningful features from the encoded representation. To prevent this, various regularization techniques can be used to force the autoencoder to focus on the most important features in the input data, while filtering out the noise. We can add some regularization terms to the loss function, such as:

Sparsity: Sparsity regularization is a technique used in autoencoders to encourage the hidden representation to be sparse, meaning that only a small subset of the neurons should be active for any given input. The idea is that sparsity can help the model learn more meaningful features by forcing it to focus on the most important information in the data. To achieve sparsity in an autoencoder, we add a penalty term to the loss function that encourages the hidden representation to have a small number of active neurons. To define this penalty term, actually, L1 loss and KL-divergence are two different methods.

Closing note — Join us in the next part as we uncover the concept of KL-divergence and its significance in training autoencoders. Keep up the great work; you’re on the right track!

Autoencoders Explained

Written by om pramod