Pre-processing and dimensionality reduction with autoencoders for image classification

Kalia Barkai
10 min readApr 11, 2020

--

Introduction

In this tutorial, we will be working through the Kaggle Aerial Cactus Identification challenge from 2019, with the aim of completing the first step of a data science pipeline: pre-processing the data, and dimensionality reduction with autoencoders for image classification.

The Data

We are working with 17500 thumbnail images of size 32 x 32. The filename for each image corresponds to its id in the train.csv file which includes the image labels. Each image id is labelled 1 if it has been categorised as a columnar cactus (Neobuxbaumia tetetzo), and 0 otherwise.

Below, you can see an example of the image data:

Image 1. Example of cacti image data. The top row images are labelled as 1 (includes a columnar cactus in the image) and the bottom row are images labelled as 0 (no columnar cacti in the images).

Downloading the Data with Google Colab

For this challenge, we are working in a Google Colab Python 3 notebook, and we have imported the data into a folder in our Google Drive.

We do this by 1) mounting the notebook to our Google Drive, 2) accessing the relevant folder within the Drive, and 3) downloading the data from Kaggle using our unique Kaggle API.

Listing 1. Importing the Kaggle image data into our Google Drive on Google Colab

In order to download the data from Kaggle, you need to download the API token unique to your account (the kaggle.json file). You can do this by going to “My Account” in Kaggle, scrolling down to “API” and clicking on “Create New API Token.” This will automatically download a kaggle.json file to your computer. You can then access the contents of this file with a text editor for code, such as Sublime or Atom, and copy the script as shown in line 18 of Listing 1.

Additionally, you will also need to accept the competition rules of the challenge. You do this by going to the challenge page on Kaggle, clicking on “Join Competition,” and then “I Understand and Accept” to agree to being bound by the competition rules.

Only once you have completed these two actions, will you be able to download the challenge data.

Importing and Pre-processing the Data

Before we import the data, we have to understand how a computer sees images. A computer does not process images the same way we do: through colours and shapes, for example. Rather, the input we give the computer is made out of an array of pixel values: numbers ranging from 0 to 255.

For example, in a grayscale image, each pixel would have a value from 0 to 255, with 0 referring to black, and 255 referring to white.

Image 2. Pixel value representation of a grayscale image (“Tutorial 1: Image Filtering”, n.d.)

When working with colour images, we would now have three values for each pixel, representing the combination of Red, Green and Blue (RGB) each ranging from 0 to 255. Therefore, our cacti images which are of size 32 x 32 pixels, should each be processed as arrays of size 32 x 32 x 3, where the 3 refers to the three image channels (RGB).

Now we are ready to import the image data into the notebook, and to pre-process the data to be used for training our autoencoder, and then the model itself.

First, we need to import the relevant libraries for importing the data and image pre-processing:

Listing 2. Libraries imported for image pre-processing

We can easily import the label data, since it is a csv file, using Pandas, and we can view the first 5 lines with the df.head()Pandas function.

Listing 3. Importing and viewing the data labels

From the output shown below, we can see that as described in The Data section, each image ID has a corresponding label 1 or 0 depending on its classification:

Image 3. Output from train_label.head() function

However, when we import our images, they will not necessarily be in the same order as the train_label dataframe. Therefore, we use the following code to create a new dataframe based on the order of the image filenames from our training data folder, and we save this csv .

Listing 4. Code used to sort the labels according to the image data that will be imported

Finally, we can use the following code in Listing 5. below to import and pre-process our image data, ensuring that all our images are the same size (32 x 32 x 3, since we are working with colour images) and appending each image array to our training list.

Listing 5. Code for importing and pre-processing the image data

We then run the following code to format our data so that it can be used in our model :

  • turning the list into an array,
  • making sure that all the data is of the same type, float32 and standardised,
  • ensuring our array is the right shape, and
  • splitting our image and label data into a train, validation and test set so that we can measure the generalisability of our model to unseen data.
Listing 6. Code for formatting and splitting our image data

Lastly, since 17500 images can take a substantial amount of time to load into our notebook, we save the numpy arrays into our folder, so that if we return to the code later, we can just import these pre-processed arrays quickly, instead of rerunning all the above code. We do this with the np.save()function, and the np.load()function of npy file types.

Listing 7. Code for saving and loading our pre-processed image arrays

Now we are finally ready to start working with this data.

Autoencoders for Dimensionality Reduction

Autoencoders are a self-supervised learning technique which utilise neural networks for data compression. The general idea is that the autoencoder takes the image data as an input and as a target, encodes the data into a smaller (bottleneck) representation, and then from this representation it decodes the data into a reconstruction of the original image.

Image 4. Example of the autoencoder process on a handwritten image of a 2 (Chollet, 2016)

The image below shows an example of the autoencoder architecture, where the input and output layers are the same size, but the hidden layer is a “bottleneck” of the image information.

Image 5. Representation of a simple autoencoder architecture with neural networks (Jordan, 2018)

Chollet (2016) highlights three main characteristics of autoencoders; they are 1) data-specific, 2) lossly, and 3) “learned automatically from examples rather than engineering by a human.”

  • Data-specific: autoencoders only succeed in compressing images similar to those it is trained on. For example, our autoencoder will be able to reconstruct unseen aerial images of cacti, but it will not be able to reconstruct unseen images of human faces.
  • Lossly: the output images from our decoder will be “degraded compared to the original inputs” (Chollet, 2016).
  • Learned automatically from data examples: there is no engineering required when creating an autoencoder, we just need to provide it with relevant training data to learn from.

The aim of the autoencoder is to be able to constrain the amount of information passed through the bottleneck, but still output successful reconstructions of the input data. Ideally, we want our autoencoder to “learn the most important attributes of the input data” (Jordan, 2018) in the hidden layer bottleneck, this is how our autoencoder can be used for dimensionality reduction.

For example, in our dataset our cacti images all have cacti in them, the attribute we want to learn, but they also consist of noise, perhaps in the form of blurriness, camera dust, or other items obscuring the image. Therefore, ideally our autoencoder is able to capture the relevant patterns of our data (i.e. the presence of a cactus in the image), while ignoring the noise which would just confuse our classifier.

I also want to highlight the trade-off that we would like our autoencoder to balance:

  • “sensitive to the inputs enough to accurately build a reconstruction” (Jordan, 2018), and
  • “insensitive enough to the inputs so that the model doesn’t simply memorise or overfit the training data” (Jordan, 2018).

The first point ensures that our model actually succeeds in reconstructing our images, and the second ensures that our model is generalisable to unseen images. If our model just memorises the input, then it cannot capture the most important attributes of unseen data in its bottleneck.

Autoencoders use objective functions to try and achieve this balance, where we have one term in the function, the reconstruction loss, calculating how well our model reproduces the original images (sensitive to the inputs), and the other term, the regulariser, attempting to reduce overfitting. We can also “add a scaling parameter in front of the regularisation term so that we can adjust the trade-off between the two objectives” (Jordan, 2018).

Equation 1. Example objective function for an autoencoder, with the reconstruction loss term on the left, and the regulariser term on the right.

The exact terms used for the reconstruction loss and the regulariser are specific to the chosen model implementation.

Returning to what we can use autoencoders for, they currently serve two main functions:

  • Denoising data: we input noisy data into the autoencoder, with the denoised data as its target, and we can train the autoencoder to clean up new noisy images of the same type.
  • Dimensionality reduction: images are feature-heavy data points, since each image is at least one 32 x 32 x 3 array (in our cacti example), therefore dimensionality reduction can be used to reduce the number of features we train our model on. If we train an autoencoder on our images, we can produce a reduced representation (the encoded bottleneck) to train a model on our images. This is helpful because 1) it might make our model more generalisable to new data, 2) training the model will be more efficient, and 3) as stated above, the bottleneck might store the most important attributes of our images which might be the features most helpful for classifying the images.

The focus in this tutorial is on using our autoencoder for dimensionality reduction. Therefore, creating the model where we classify this data falls outside of the scope of this article.

Autoencoder Implementation

For our model we are implementing a convolutional neural network (CNN) autoencoder. We won’t delve into how CNNs work for now, but just know that our decision to use a CNN for this autoencoder implementation was based on its greater ability to store more of the spatial image information, than a simpler autoencoder.

An autoencoder includes three main parts, the encoder, decoder and a distance function which records “the amount of information loss between the compressed representation of your data and the decompressed representation (i.e. a “loss” function)” (Chollet, 2016).

We will want to use the following libraries for our Keras implementation of the autoencoder:

Listing 8. Code for importing the relevant Keras libraries for the neural network of the autoencoder

The code below shows the full architecture of our autoencoder using a neural network where we have three layers each for the encoder and the decoder.

As you can see, we use a binary cross-entropy loss function. This type of loss function is used when our output is a probability between 0 and 1 (in this case how probable is that our reconstruction looks like our original image), and, we can expect our cross-entropy loss to increase when our reconstruction diverges from the original image.

Additionally, we compile our autoencoder with an optimiser (adadelta) which both updates our model parameters in order to minimise our loss between iterations, and defines at what rate (the learning rate) our model parameters update between iterations.

Listing 9. Autoencoder code, showing the encoder, decoder, and then compiling the autoencoder from the input layer and the output layer.

We used a print function to see how the dimensions of our image data change throughout the autoencoder, with the following results:

(?, 16, 16, 256)
(?, 8, 8, 128)
(?, 4, 4, 64)
(?, 8, 8, 64)
(? 16, 16, 128)
(?, 32, 32, 256)
(?, 32, 32, 3)

We then fit this autoencoder to our training data. We have set the number of iterations (epochs) to 1000, and the batch size (the number of training examples used per iteration) to 256.

Listing 10. Fitting the autoencoder to the training data and using eval_data for validation

We can save our model for future use, using the Keras save and load_model functions:

Listing 11. Saving and loading our model architecture and weights for future use

The importance of creating a full encoder, even though we will only use the encoder to reduce the dimensionality of our images for the classification problem, is that we want to know how successful our autoencoder is at capturing the most important attributes.

To measure whether our autoencoder is improving over the iterations, we can plot a function that shows how our loss changes over time for the training set and the validation set. The lower the loss, the less the difference between our original images and our reconstructed images, and therefore the better our model.

Listing 12. Code used to plot our loss over the iterations for the training and validation during fitting
Graph 1. Plot of the loss over 1000 epochs for the training and validation data when fitting our autoencoder. This plot shows that during the training of our autoencoder, both the loss calculated on our training data and on our validation data is minimised to below 0.66. We can interpret this to mean that our model is not overfitting on our training data, as then the loss on our validation data would not reduce as much. Additionally, as we near 1000 epochs, the loss is minimised at a much slower rate, meaning we might be converging on the model’s best parameters.

As we can see from the plot above, the loss gets reduced to 0.6562 for our training data, and around 0.6570 for our validation set. We can tell by this plot that our autoencoder is not overfitting too much, as the validation loss (the loss calculated on unseen data) reduces during training. It also seems as though the validation loss is still reducing, but at a much slower rate by 1000 epochs, therefore with more iterations, our model could improve. However, since it can take many hours, even days, to train this model for 1000s of epochs on Google Colab’s free GPU, for the purpose of this tutorial, we will stop at 1000 iterations.

Additionally, we can use the validation set to reconstruct some of the images through our autoencoder and compare them to the original images to manually analyse how well our autoencoder works.

Listing 13. Code used to reconstruct images from validation set and compare to original images in a plot
Image 6. The top row shows an original subset of the cacti images from the dataset and the bottom row shows the reconstructed images after using the autoencoder to predict these images. I.e. passing the images through the autoencoder’s bottleneck and reconstructing them with the reduced information stored.

As we can see, the autoencoder is unable to capture all the details of the original images in its reconstruction. However, it does reconstruct the more major visual differences, such as contrasted shapes in some of the images, and the general colour differences in the images.

The next step would now be to test a classification model on both the encoded data, and on the original dataset, and see whether our autoencoder helps improve the efficiency and/or the accuracy of our classification model.

References

--

--