What happens in Sparse Autoencoder

How L1 regularization affects Autoencoder

5 min readDec 4, 2018

What is AutoEncoder?

Autoencoders are an important part of unsupervised learning models in the development of deep learning. While autoencoders aim to compress representations and preserve essential information for reconstructing input data, they are often used for dimensionality reduction or feature learning.

The general architecture of autoencoders

A basic autoencoder can be also simply regarded as neural networks we have learning during our first lesson in deep learning and of course, they are optimized using gradient descend based optimization methods.

The difference between a basic autoencoder and neural networks is that autoencoder is composed of two symmetric parts: encoder and decoder with dimensions of compressed representations smaller than dimensions of the original input. For example, the input size of MNIST dataset can be (SAMPLE_NUM, 784) while the compressed representation would be (SAMPLE_NUM, 64). And in this way, we make sure that we’re learning compressed representations used to approximate input data and not just identity function which remembers original input (figure below).

How does it work?

A reliable autoencoder must make a tradeoff between two important parts:

Sensitive enough to inputs so that it can accurately reconstruct input data
Able to generalize well even when evaluated on unseen data

As a result, our loss function of autoencoder is composed of two different parts. The first part is the loss function (e.g. mean squared error loss) calculating the difference between input data and output data while the second term would act as regularization term which prevents autoencoder from overfitting.

Loss function of autoencoder

SGD and some other gradient descent based. optimization methods can be used to optimize this loss function.

Sparse Autoencoder

A sparse autoencoder is simply an autoencoder whose training criterion involves a sparsity penalty. In most cases, we would construct our loss function by penalizing activations of hidden layers so that only a few nodes are encouraged to activate when a single sample is fed into the network.

The intuition behind this method is that, for example, if a man claims to be an expert in mathematics, computer science, psychology, and classical music, he might be just learning some quite shallow knowledge in these subjects. However, if he only claims to be devoted to mathematics, we would like to anticipate some useful insights from him. And it’s the same for autoencoders we’re training — fewer nodes activating while still keeping its performance would guarantee that the autoencoder is actually learning latent representations instead of redundant information in our input data.

There are actually two different ways to construct our sparsity penalty: L1 regularization and KL-divergence. And here we will only talk about L1 regularization.

Why L1 Regularization Sparse

L1 regularization and L2 regularization are widely used in machine learning and deep learning. L1 regularization adds “absolute value of magnitude” of coefficients as penalty term while L2 regularization adds “squared magnitude” of coefficient as a penalty term.

Although L1 and L2 can both be used as regularization term, the key difference between them is that L1 regularization tends to shrink the penalty coefficient to zero while L2 regularization would move coefficients towards zero but they will never reach. Thus L1 regularization is often used as a method of feature extraction. But why L1 regularization leads to sparsity?

Consider that we have two loss functions L1 and L2 which represent L1 regularization and L2 regularization respectively.

Gradient descent is always used in optimizing neural networks. If we plot these two loss functions and their derivatives, it looks like this:

We can notice that for L1 regularization, the gradient is either 1 or -1 except when w=0, which means that L1 regularization will always move w towards zero with same step size (1 or -1) regardless of the value of w. And when w=0, the gradient becomes zero and no update will be made anymore. However, for L2 regularization things are different. L2 regularization will also move w towards zero but the step size becomes smaller and smaller which means that w will never reach zero.

This is the intuition behind L1 regularization’s sparsity. More mathematic details can be reached here.

Loss Function

Finally, after the above analysis, we get the idea of using L1 regularization in sparse autoencoder and the loss function is as below:

Except for the first two terms, we add the third term which penalizes the absolute value of the vector of activations a in layer h for sample i. Then we use a hyperparameter to control its effect on the whole loss function. And in this way, we do build a sparse autoencoder.

Visualization

Wait, but how does it behave? To test its performance, I tried to build a deep autoencoder and train it on MNIST dataset without L1 regularization and with regularization. The structure of this deep autoencoder is plotted as below:

And after 100 epochs of training using 128 batch size and Adam as the optimizer, I got below results:

As we can see, sparse autoencoder with L1 regularization with best mse loss 0.0301 actually performs better than autoencoder with best mse loss 0.0318. Although it’s just a slight improvement, it comes out that the sparse autoencoder actually learns better representation than autoencoder.

And what about sparsity? We can simply extract the weights in the first hidden layer and reshape them for visualizations to check if the activations of sparse autoencoder are actually more “sparse” than original autoencoder.

Here comes the conclusion: due to the sparsity of L1 regularization, sparse autoencoder actually learns better representations and its activations are more sparse which makes it perform better than original autoencoder without L1 regularization.