Auto Encoder with Practical Implementation

Amir Ali
The Art of Data Scicne
18 min readMay 26, 2019

In this Chapter of Deep Learning, we will discuss Auto Encoders. It is an Unsupervised Deep Learning technique and we will discuss both theoretical and Practical Implementation from Scratch.

This chapter spans 5 parts:

  1. What is Auto-Encoders?
  2. Structure of Auto-Encoders.
  3. How does Auto-Encoders work?
  4. Different Types of Auto-Encoders.
  5. Practical Implementation of Auto-Encoders.

1. What is Auto-Encoders?

An autoencoder is a neural network that tries to reconstruct its input.

So if you feed the autoencoder the vector (1,0,0,0) the autoencoder will try to output (1,0,0,0). Of course, I will have to explain why this is useful and how this works.

The trick is the hidden layer, say you have inputs in 4 dimensions as in our example if we use 2 neurons in the hidden layer then our autoencoder will receive 4 features, and “encode” them in 2 features in a way such as it can re-construct the same 4 dim input.

So we go from (1,0,0,0) to (x,y) and from (x,y) to (1,0,0,0).

We train the autoencoder with several points of data, maybe thousands of millions and it will find the weights that minimize the reconstruction error. The weights are what we use to transform (1,0,0,0) into (x,y) and (x,y) into (1,0,0,0).

Imagine our data has a vector for each user and each dimension corresponds to a movie the user likes or not (1 or 0 for each movie), then the internal representation will “compress” the movies each user likes or dislikes to a feature vector where our first dimension can be, for example, if the movie is action or not. In other words, the autoencoder will learn the features hidden in the preferences the user told us.

Let’s see a quick example:

The data is the famous MNIST dataset, each MNIST image is a scan of a handwritten digit in a 28x28 image, so our “inputs” are in 28x28 = 784 dimensions. We train an autoencoder using just 25 hidden neurons:

On the left some of our original input points, on the right what the autoencoder can reconstruct from the 25 dimensions in the middle layer. It’s not perfect but it is pretty awesome isn’t it?

So this shows that we can represent each MNIST digit as a vector in 25 dimensions, and this is where we can see the utility of an autoencoder, it is a feature extraction algorithm it helps us find a representation for our data and we can feed that representation to other algorithms, for example, a classifier.

In some cases, the features generated by the autoencoder represent the data points better than the points themselves, that’s the key!

Autoencoders can be stacked and trained in a progressive way, we train an autoencoder and then we take the middle layer generated by the AE and use it as input for another AE and so on. This is the first step towards deep learning, the stacked autoencoders will learn how to represent data, the first level will have a basic representation, the second level will combine that representation to create a higher-level representation and so on. Think about images, a first-level autoencoder will learn to detect borders as features, the second level will combine those borders to learn traces and patterns, etc.

So to summarize:

- Autoencoders are neural networks trained to reconstruct their original input.

- An Autoencoder is a form of feature extraction algorithm.

- Autoencoders can be stacked.

- The output of an autoencoder is the middle layer, the representation for each data point.

- We can use the features generated by an AE in any other algorithm, for example for classification.

An autoencoder has a lot of freedom and that usually means our AE can overfit the data because it has just too many ways to represent it. To constrain this we should use sparse autoencoders where a non-sparsity penalty is added to the cost function. In general, when we talk about autoencoders we are really talking about sparse autoencoders.

Autoencoders can be magical but they need to be fed some hyper-parameters and finding the optimal values for those hyper-parameters can be a time-consuming operation.

2. Structure of Auto-Encoders.

The basic structure of Autoencoders contains three layers. The basic purpose of Autoencoders is that when trained, the output should be exactly the same as the input. So if we feed the network a picture of a cat, it will give the exact same picture back. Autoencoders have an input layer, an output layer and one or more hidden layers connecting them, but with the output layer having the same number of nodes as the input layer, and with the purpose of reconstructing its own inputs instead of predicting the target value Y given inputs X. Therefore, autoencoders are unsupervised learning models. An autoencoder learns to compress data from the input layer into a short code present between the input and output layer, and then uncompress that code into something that closely matches the original data. This forces the autoencoder to engage in dimensionality reduction.

An autoencoder always consists of two parts, the encoder, and the decoder. The first half that converts the information into the narrow region is called the encoding portion and the second half that converts it back into the original information is called the decoding portion.

3. How does Auto-Encoders work?

We take the input, encode it to identify latent feature representation. Decode the latent feature representation to recreate the input. We calculate the loss by comparing the input and output. To reduce the reconstruction error we backpropagate and update the weights. Weight is updated based on how much they are responsible for the error.

Let’s break it down step by step.

In our example, we have taken the dataset for products bought by customers

Step 1: Take the first row from the customer data for all products bought in an array as the input. 1 represents that the customer bought the product. 0 represents that the customer did not buy the product.

Step 2: Encode the input into another vector h. h is a lower dimension vector than the input. We can use the sigmoid activation function for h as it ranges from 0 to 1. W is the weight applied to the input and b is the bias term.

h=f(Wx+b)

Step 3: Decode the vector h to recreate the input. The output will be of the same dimension as the input

Step 4: Calculate the reconstruction error L. Reconstruction error is the difference between the input and output vector. Our goal is to minimize the reconstruction error so that output is similar to the input vector

Reconstruction error= input vector — output vector

Loss function or Reconstruction error

Step 5: Back propagate the error from the output layer to the input layer to update the weights. Weights are updated based on how much they were responsible for the error.

The learning rate decides by how much we update the weights.

Step 6: Repeat steps 1 through 5 for each of the observations in the dataset. Weights are updated after each observation(Stochastic Gradient descent)

Step 7: Repeat more epochs. Epoch is when all the rows in the dataset have passed through the neural network.

3.1: Let’s Solve with EXAMPLE

As we already know that Images have lots and lots of raw information (pixels) and easily identifiable features that can better represent the image. So let’s solve an example to extract useful features that actually represent the image. In the encoding portion, we need to add a convolutional layer(which we already discuss in chapter 3) to autoencoder and then pull identifiable features pattern from image pixels. However, in the latter part called a decoding portion, we also need a deconvolutional layer and then apply unspooling to reconstruct the image. This process is called upsampling.

Unfortunately, this is basically impossible as we’ve thrown away the information regarding the original location of each pixel from the other squares in the pooling process. And in the decoding portion if we are unspooling a 1x1 region to a 2x2 region (reversing 2x2 pooling), we first have to move the 1x1 region to the right location in the 2x2 box, simply said we first have to store the pixel back to its original location from where it pulled out during the pulling process. So the problem to remember the original location of each pixel is overcome by using switches, which essentially store the original location from the pooling. Basically, each pooled region takes not only the value but the original location it was in before the pool, which is then used in the unspooling. The remaining pixels are filled with 0s.

3.1.1 Encoding Portion

Suppose we have an image’s convolutional layer of 4x4 pixel matrix with some simple sort of patterns highlighted with colors in the image given below. Each box represents a pixel and its value.

Now we will apply the pulling process on the convolutional layer to convert the image into a compressed form with useful extracted features. We apply the 2x2 matrix with a stride length of 2 on the convolutional layer to compress the image.

We take these important and useful high-level feature values representing a specific pattern available in the convolutional layer of the image and compressed the convolutional layer into a 2x2 matrix during the pulling process from the 4x4 matrix to 2x2 matrix.

Note: The convolutional step is in detail discuss in chapter 3.

Our feature map, also called filter or kernel is

3.1.2 Decoding Portion

In the decoding portion, we will unspool the 2x2 compressed form matrix extracted from the convolutional layer to reconstruct the image. To do this, take the kernel and multiply every single location with the pixel value of the 2x2 compressed matrix that we are deconvolving. Where there are overlaps, we sum the values.

Step 1:

After multiplication;

We use switches represented by number 0 to remember the original location of each pixel

Our first unpooled matrix;

Step 2:

Our second unpooled matrix

Step 3:

Our third unpooled matrix;

Step 4:

Our forth unpooled matrix

As a result of overlaps, we sum all the unpooled matrix to reconstruct the final image;

So our final output matrix is

So our final output matrix is exactly the same as the input matrix

So in this way we Encoding and Decoding.

4. Different Types of Auto-Encoders?

Here we discuss 5 different types of Auto Encoders.

4.1 Sparse Autoencoders

· Sparse autoencoders have hidden nodes greater than input nodes. They can still discover important features from the data.

· Sparsity constraint is introduced on the hidden layer. This is to prevent output layer copy input data.

· Sparse autoencoders have a sparsity penalty, Ω(h), a value close to zero but not zero. The sparsity penalty is applied to the hidden layer in addition to the reconstruction error. This prevents overfitting.

· Sparse autoencoders take the highest activation values in the hidden layer and zero out the rest of the hidden nodes. This prevents autoencoders to use all of the hidden nodes at a time and forcing only a reduced number of hidden nodes to be used.

· As we activate and inactivate hidden nodes for each row in the dataset. Each hidden node extracts a feature from the data

4.2 Denoising Autoencoders

· Denoising refers to intentionally adding noise to the raw input before providing it to the network. Denoising can be achieved using stochastic mapping.

· Denoising autoencoders create a corrupted copy of the input by introducing some noise. This helps to avoid the autoencoders to copy the input to the output without learning features about the data.

· Corruption of the input can be done randomly by making some of the input as zero. Remaining nodes copy the input to the noised input.

· Denoising autoencoders must remove the corruption to generate an output that is similar to the input. The output is compared with input and not with noised input. To minimize the loss function we continue until convergence

· Denoising autoencoders minimize the loss function between the output node and the corrupted input.

· Denoising helps the autoencoders to learn the latent representation present in the data. Denoising autoencoders ensure a good representation is one that can be derived robustly from a corrupted input and that will be useful for recovering the corresponding clean input.

· Denoising is a stochastic autoencoder as we use a stochastic corruption process to set some of the inputs to zero

4.3 Contractive Auto Encoders

· The contractive autoencoder(CAE) objective is to have a robust learned representation that is less sensitive to the small variation in the data.

· Robustness of the representation for the data is done by applying a penalty term to the loss function. The penalty term is the Frobenius norm of the Jacobian matrix. Frobenius norm of the Jacobian matrix for the hidden layer is calculated with respect to the input. Frobenius norm of the Jacobian matrix is the sum of the square of all elements.

Loss function with penalty term — Frobenius norm of the Jacobian matrix

· A contractive autoencoder is another regularization technique like sparse autoencoders and denoising autoencoders.

· CAE surpasses results obtained by regularizing autoencoder using weight decay or by denoising. CAE is a better choice than denoising autoencoders to learn useful feature extraction.

· Penalty term generates mapping which is strongly contracting the data and hence the name contractive autoencoder.

4.4 Stacked Auto Encoders

· Stacked Autoencoders is a neural network with multiple layers of sparse autoencoders

· When we add more hidden layers than just one hidden layer to an autoencoder, it helps to reduce high dimensional data to a smaller code representing important features

· Each hidden layer is a more compact representation than the last hidden layer

· We can also denoise the input and then pass the data through the stacked autoencoders called as stacked denoising autoencoders

· In Stacked Denoising Autoencoders, input corruption is used only for initial denoising. This helps learn important features present in the data. Once the mapping function f(θ) has been learned. For further layers, we use uncorrupted input from the previous layers.

· After training a stack of encoders as explained above, we can use the output of the stacked denoising autoencoders as an input to a stand-alone supervised machine learning like support vector machines or multi-class logistics regression.

4.5 Deep Auto Encoders

· Deep Autoencoders consists of two identical deep belief networks. One network for encoding and another for decoding

· Typically deep autoencoders have 4 to 5 layers for encoding and the next 4 to 5 layers for decoding. We use unsupervised layer by layer pre-training

· Restricted Boltzmann Machine(RBM) is the basic building block of the deep belief network. We will do RBM is a different post.

· In the above figure, we take an image with 784 pixels. Train using a stack of 4 RBMs, unroll them and then finetune with backpropagation

· The final encoding layer is compact and fast.

Note: If you want this article check out my academia.edu profile.

5. Practical Implementation of Auto-Encoders.

Recommender System

From Amazon product suggestions to Netflix movie recommendations — good recommender systems are very valuable in today’s World. And specialists who can create them are some of the top-paid Data Scientists on the planet.

We will work on a dataset that has exactly the same features as the Netflix dataset: plenty of movies, thousands of users, who have rated the movies they watched. The ratings go from 1 to 5, exactly like in the Netflix dataset, which makes the Recommender System more complex to build than if the ratings were simply “Liked” or “Not Liked”.

Your final Recommender System will be able to predict the ratings of the movies the customers didn’t watch. Accordingly, by ranking the predictions from 5 down to 1, your Deep Learning model will be able to recommend which movies each user should watch.

Our model will be a powerful Auto Encoder ( previous chapter we apply the RBM model). And you will even be able to apply it to yourself or your friends. The list of movies will be explicit so you will simply need to rate the movies you already watched, input your ratings in the dataset, execute your model and voila! The Recommender System will tell you exactly which movies you would love one night you if are out of ideas of what to watch on Netflix!

Let’s solve the problem

Part 1: Data Preprocessing

In this part, we are doing Data Preprocessing.

1.1 Import the Libraries

In this step, we import three Libraries in Data Preprocessing part. Basically, Library is a tool that you can use to make a specific job. First of all, we import the numpy library used for a multidimensional array then import the pandas library used to import the dataset. Then import torch the Pytorch library and import several packages of that. torch.nn as nn for initializing the neural network. torch.nn.parallel for parallel computations. torch.optim as optim for the optimizer. torch.utils.data for data loading and processing. autograd for implementing automatic differentiation

1.2 Import the dataset

In the next step, we import the users, ratings, and movie dataset. In our case, our dataset is separated by double colons. The dataset does not have any headers so we shall pass the headers as none. We then set the engine to Python to ensure the dataset is correctly imported.

We then use the Latin-1 encoding type since some of the movies have special characters in their titles. The first column of the rating dataset is the user ID, the second column is the movie ID, the third column is the rating and the fourth column is the timestamp.

1.3 Preparing the training set and test set

Let’s now prepare our training set and test set. Our test and training sets are tab-separated; therefore we’ll pass in the delimiter argument as \t. As we know very well, pandas import the data as a data frame. However, we need to convert it to an array so we can use it in PyTorch tensors. We do that using the np.array command from Numpy. We also specify that our array should be integers since we’re dealing with integer data types.

1.4 Getting the Number of Users and Movies

In order to build the RBM, we need a matrix with the users’ ratings. This matrix will have the users as the rows and the movies as the columns. The matrix will contain a user’s rating of a specific movie. Zeros will represent observations where a user didn’t rate a specific movie.

In order to create this matrix, we need to obtain the number of movies and the number of users in our dataset. For no_users we pass in zero since it’s the index of the user ID column. The way we obtain the number of users is by getting the max in the training and test set, and then using the max utility to get the maximum of the two. We then force the obtained number to be an integer by wrapping the entire function inside an int.

1.5 Converting the data into an array with users in lines and movies in columns

Next, we create a function that will create the matrix. The reason for doing this is to set up the dataset in a way that the RBM expects as input. We create a function called convert, which takes in our data as input and converts it into the matrix.

First, we create an empty list called new_data. We then create a for loop that will go through the dataset, fetch all the movies rated by a specific user, and the ratings by that same user. Notice that we loop up to no_users + 1 to include the last user ID since the range function doesn’t include the upper bound.

Since there are movies that the user didn’t rate, we first create a matrix of zeros. We then update the zeros with the user’s ratings. When appending the movie ratings, we use id_movies — 1 because indices in Python start from zero. We, therefore, subtract one to ensure that the first index in Python is included. We append the ratings to new_data as a list. This will create a list of lists. Now let’s use our function and convert our training and test data into a matrix.

1.6 Converting the data into Torch tensors

Since we’re using PyTorch, we need to convert the data into Torch tensors. The way we do this is by using the FloatTensor utility. This will convert the dataset into PyTorch arrays.

Part 2: Building our Model

In this Second Part, we will Build our model which is Auto Encoder.

2.1 Creating the Autoencoder Architecture

Now we need to create a class to define the architecture of the Auto Encoder. Inside the Class, we define two functions in the first function we create the basic architecture of autoencoder fc1 and fc2 basically we encoding and fc3 and fc4 we decoding the values. In the second function, we apply the activation function in our first three layers as you can see below code.

Part 3: Training the Autoencoder Model

The first step in training the AE is to define the number of epochs. We then define a loop where all the training set will go through. After each epoch, the weight will be adjusted in order to improve the predictions. Finally, we obtain the visible nodes with the ratings of the movies that were not rated by the users.

Part 4: Testing the AE Model

Next, we test our Model. In this stage, we use the training set data to activate the hidden neurons in order to obtain the output. This is how we get the predicted output of the test set. We then use the absolute mean to compute the test loss.

If you want dataset and code you also check my Github Profile.

End Notes

If you liked this article, be sure to click ❤ below to recommend it and if you have any questions, leave a comment and I will do my best to answer.

For being more aware of the world of machine learning, follow me. It’s the best way to find out when I write more articles like this.

You can also follow me on Github for code & dataset follow on Aacademia.edu for this article, Twitter and Email me directly or find me on LinkedIn. I’d love to hear from you.

That’s all folks, Have a nice day :)

--

--