Pre-processing and dimensionality reduction with autoencoders for image classification
Introduction
In this tutorial, we will be working through the Kaggle Aerial Cactus Identification challenge from 2019, with the aim of completing the first step of a data science pipeline: pre-processing the data, and dimensionality reduction with autoencoders for image classification.
The Data
We are working with 17500 thumbnail images of size 32 x 32. The filename for each image corresponds to its id
in the train.csv
file which includes the image labels. Each image id
is labelled 1
if it has been categorised as a columnar cactus (Neobuxbaumia tetetzo), and 0
otherwise.
Below, you can see an example of the image data:
Downloading the Data with Google Colab
For this challenge, we are working in a Google Colab Python 3 notebook, and we have imported the data into a folder in our Google Drive.
We do this by 1) mounting the notebook to our Google Drive, 2) accessing the relevant folder within the Drive, and 3) downloading the data from Kaggle using our unique Kaggle API.
In order to download the data from Kaggle, you need to download the API token unique to your account (the kaggle.json
file). You can do this by going to “My Account” in Kaggle, scrolling down to “API” and clicking on “Create New API Token.” This will automatically download a kaggle.json
file to your computer. You can then access the contents of this file with a text editor for code, such as Sublime or Atom, and copy the script as shown in line 18 of Listing 1.
Additionally, you will also need to accept the competition rules of the challenge. You do this by going to the challenge page on Kaggle, clicking on “Join Competition,” and then “I Understand and Accept” to agree to being bound by the competition rules.
Only once you have completed these two actions, will you be able to download the challenge data.
Importing and Pre-processing the Data
Before we import the data, we have to understand how a computer sees images. A computer does not process images the same way we do: through colours and shapes, for example. Rather, the input we give the computer is made out of an array of pixel values: numbers ranging from 0 to 255.
For example, in a grayscale image, each pixel would have a value from 0 to 255, with 0 referring to black, and 255 referring to white.
When working with colour images, we would now have three values for each pixel, representing the combination of Red, Green and Blue (RGB) each ranging from 0 to 255. Therefore, our cacti images which are of size 32 x 32 pixels, should each be processed as arrays of size 32 x 32 x 3, where the 3 refers to the three image channels (RGB).
Now we are ready to import the image data into the notebook, and to pre-process the data to be used for training our autoencoder, and then the model itself.
First, we need to import the relevant libraries for importing the data and image pre-processing:
We can easily import the label data, since it is a csv
file, using Pandas, and we can view the first 5 lines with the df.head()
Pandas function.
From the output shown below, we can see that as described in The Data section, each image ID has a corresponding label 1
or 0
depending on its classification:
However, when we import our images, they will not necessarily be in the same order as the train_label
dataframe. Therefore, we use the following code to create a new dataframe based on the order of the image filenames from our training data folder, and we save this csv
.
Finally, we can use the following code in Listing 5. below to import and pre-process our image data, ensuring that all our images are the same size (32 x 32 x 3, since we are working with colour images) and appending each image array to our training list.
We then run the following code to format our data so that it can be used in our model :
- turning the list into an array,
- making sure that all the data is of the same type,
float32
and standardised, - ensuring our array is the right shape, and
- splitting our image and label data into a train, validation and test set so that we can measure the generalisability of our model to unseen data.
Lastly, since 17500 images can take a substantial amount of time to load into our notebook, we save the numpy arrays into our folder, so that if we return to the code later, we can just import these pre-processed arrays quickly, instead of rerunning all the above code. We do this with the np.save()
function, and the np.load()
function of npy
file types.
Now we are finally ready to start working with this data.
Autoencoders for Dimensionality Reduction
Autoencoders are a self-supervised learning technique which utilise neural networks for data compression. The general idea is that the autoencoder takes the image data as an input and as a target, encodes the data into a smaller (bottleneck) representation, and then from this representation it decodes the data into a reconstruction of the original image.
The image below shows an example of the autoencoder architecture, where the input and output layers are the same size, but the hidden layer is a “bottleneck” of the image information.
Chollet (2016) highlights three main characteristics of autoencoders; they are 1) data-specific, 2) lossly, and 3) “learned automatically from examples rather than engineering by a human.”
- Data-specific: autoencoders only succeed in compressing images similar to those it is trained on. For example, our autoencoder will be able to reconstruct unseen aerial images of cacti, but it will not be able to reconstruct unseen images of human faces.
- Lossly: the output images from our decoder will be “degraded compared to the original inputs” (Chollet, 2016).
- Learned automatically from data examples: there is no engineering required when creating an autoencoder, we just need to provide it with relevant training data to learn from.
The aim of the autoencoder is to be able to constrain the amount of information passed through the bottleneck, but still output successful reconstructions of the input data. Ideally, we want our autoencoder to “learn the most important attributes of the input data” (Jordan, 2018) in the hidden layer bottleneck, this is how our autoencoder can be used for dimensionality reduction.
For example, in our dataset our cacti images all have cacti in them, the attribute we want to learn, but they also consist of noise, perhaps in the form of blurriness, camera dust, or other items obscuring the image. Therefore, ideally our autoencoder is able to capture the relevant patterns of our data (i.e. the presence of a cactus in the image), while ignoring the noise which would just confuse our classifier.
I also want to highlight the trade-off that we would like our autoencoder to balance:
- “sensitive to the inputs enough to accurately build a reconstruction” (Jordan, 2018), and
- “insensitive enough to the inputs so that the model doesn’t simply memorise or overfit the training data” (Jordan, 2018).
The first point ensures that our model actually succeeds in reconstructing our images, and the second ensures that our model is generalisable to unseen images. If our model just memorises the input, then it cannot capture the most important attributes of unseen data in its bottleneck.
Autoencoders use objective functions to try and achieve this balance, where we have one term in the function, the reconstruction loss, calculating how well our model reproduces the original images (sensitive to the inputs), and the other term, the regulariser, attempting to reduce overfitting. We can also “add a scaling parameter in front of the regularisation term so that we can adjust the trade-off between the two objectives” (Jordan, 2018).
The exact terms used for the reconstruction loss and the regulariser are specific to the chosen model implementation.
Returning to what we can use autoencoders for, they currently serve two main functions:
- Denoising data: we input noisy data into the autoencoder, with the denoised data as its target, and we can train the autoencoder to clean up new noisy images of the same type.
- Dimensionality reduction: images are feature-heavy data points, since each image is at least one 32 x 32 x 3 array (in our cacti example), therefore dimensionality reduction can be used to reduce the number of features we train our model on. If we train an autoencoder on our images, we can produce a reduced representation (the encoded bottleneck) to train a model on our images. This is helpful because 1) it might make our model more generalisable to new data, 2) training the model will be more efficient, and 3) as stated above, the bottleneck might store the most important attributes of our images which might be the features most helpful for classifying the images.
The focus in this tutorial is on using our autoencoder for dimensionality reduction. Therefore, creating the model where we classify this data falls outside of the scope of this article.
Autoencoder Implementation
For our model we are implementing a convolutional neural network (CNN) autoencoder. We won’t delve into how CNNs work for now, but just know that our decision to use a CNN for this autoencoder implementation was based on its greater ability to store more of the spatial image information, than a simpler autoencoder.
An autoencoder includes three main parts, the encoder, decoder and a distance function which records “the amount of information loss between the compressed representation of your data and the decompressed representation (i.e. a “loss” function)” (Chollet, 2016).
We will want to use the following libraries for our Keras implementation of the autoencoder:
The code below shows the full architecture of our autoencoder using a neural network where we have three layers each for the encoder and the decoder.
As you can see, we use a binary cross-entropy loss function. This type of loss function is used when our output is a probability between 0 and 1 (in this case how probable is that our reconstruction looks like our original image), and, we can expect our cross-entropy loss to increase when our reconstruction diverges from the original image.
Additionally, we compile our autoencoder with an optimiser (adadelta
) which both updates our model parameters in order to minimise our loss between iterations, and defines at what rate (the learning rate) our model parameters update between iterations.
We used a print
function to see how the dimensions of our image data change throughout the autoencoder, with the following results:
(?, 16, 16, 256)
(?, 8, 8, 128)
(?, 4, 4, 64)
(?, 8, 8, 64)
(? 16, 16, 128)
(?, 32, 32, 256)
(?, 32, 32, 3)
We then fit this autoencoder to our training data. We have set the number of iterations (epochs
) to 1000, and the batch size (the number of training examples used per iteration) to 256.
We can save our model for future use, using the Keras save
and load_model
functions:
The importance of creating a full encoder, even though we will only use the encoder to reduce the dimensionality of our images for the classification problem, is that we want to know how successful our autoencoder is at capturing the most important attributes.
To measure whether our autoencoder is improving over the iterations, we can plot a function that shows how our loss changes over time for the training set and the validation set. The lower the loss, the less the difference between our original images and our reconstructed images, and therefore the better our model.
As we can see from the plot above, the loss gets reduced to 0.6562 for our training data, and around 0.6570 for our validation set. We can tell by this plot that our autoencoder is not overfitting too much, as the validation loss (the loss calculated on unseen data) reduces during training. It also seems as though the validation loss is still reducing, but at a much slower rate by 1000 epochs, therefore with more iterations, our model could improve. However, since it can take many hours, even days, to train this model for 1000s of epochs on Google Colab’s free GPU, for the purpose of this tutorial, we will stop at 1000 iterations.
Additionally, we can use the validation set to reconstruct some of the images through our autoencoder and compare them to the original images to manually analyse how well our autoencoder works.
As we can see, the autoencoder is unable to capture all the details of the original images in its reconstruction. However, it does reconstruct the more major visual differences, such as contrasted shapes in some of the images, and the general colour differences in the images.
The next step would now be to test a classification model on both the encoded data, and on the original dataset, and see whether our autoencoder helps improve the efficiency and/or the accuracy of our classification model.
References
- Jordan, J. (2018). Introduction to autoencoders. Retrieved from https://www.jeremyjordan.me/autoencoders/
- Chollet, F. (2016). Building autoencoders in keras. The Keras Blog. Retrieved from https://blog.keras.io/building-autoencoders-in-keras.html
- “Tutorial 1: Image Filtering.” (n.d.). Introduction to Computer Vision. Retrieved from https://ai.stanford.edu/~syyeung/cvweb/tutorial1.html