Review of Unsupervised Feature Learning

Published in

joelthchao

3 min readOct 3, 2017

This is a brief review of tutorials from Stanford University— Unsupervised Feature Learning and Deep Learning. Notice that we skip details and only focus on insight of different algorithms.

Autoencoder

Autoencoder extracts the most informative feature of data to reconstruct itself. There are two strategies to achieve it. First, using nonlinearity process to enhance encoder capability, and L2 distance as objective function to guide reconstruction. Second, sparsity penalty, in terms of mean activation level to regularize autoencoder to encode information efficiently and reduce overfitting. As a result, autoencoder allows us to adjust hidden unit number and activation level to train a best model to generate proper representation.

Principle Component Analysis (PCA)

Close to autoencoder, PCA extracts the most important part of data. However, PCA focus more on mathematically analysis (structure) on data. It applies SVD on the covariance matrix and select top k eigenvectors as coordinates to project data.

If our goal is to remove irrelevant information, then eigenvalue can be a factor to judge importance of coordinates (eigenvector). In practice, we select top k eigenvalues which account for up to 99%, or 95% for image data.

For image data, PCA can be a good normalization tool. First select top k eigenvectors as coordinates to project. Then, on the projected (or rotated) data, each dimension we divide it with square root of corresponding eigenvalue. Therefore, every dimension of data can be (1) uncorrelated (2) unit variance.

In addition, PCA is suitable for dimension reduction. We can utilize matrix of eigenvector as projection matrix and control outcome dimension by adjusting number of eigenvectors.

Sparse Coding

The difference between Sparse Coding and other unsupervised feature learning is that it starts from a set of over-completed bases and seek to reconstruct data with least bases. Same as autoencoder, it has L2 distance objective for reconstruction and sparsity penalty (usually L1). In my opinion, the most attractive part of Sparse Coding is its degree of freedom. We can add additional regularization term to make it become semi-supervised learning. (Ref: Scalable Object Detection by Filter Compression with Regularized Sparse Coding)

It’s worth noting that L1 sparsity penalty is applied on the mapping (feature → data), not the activation. Therefore, this mapping vector can be used as a representative feature on many applications. Also, its sparse nature is perfect for efficient distance calculation.

Independent Component Analysis (ICA)

Different to Sparse Coding, ICA has L1 penalty on mapping (data → feature) and linearly independent bases. Data need to be whiten to conduct ICA, and a set of under-complete bases is learnt in the end.

To deal with hard optimization problem comes with ICA’s hard constraints. RICA (Reconstruction ICA) introduces reconstruction error to objective function. Overall, RICA is like Sparse autoencoder without nonlinearity and able to build over-completed bases. As a result, RICA has more flexibility to apply on different application compared to ICA.

Here is a good article about difference between ICA and PCA.

PCA vs Auto-encoder

PCA focuses more on the structure of data. However, it results memory limit issue and sometime lack of scalability. Autoencoder focuses more on reconstruction and has good scalability. Both algorithms have their own pros and cons and fit to different kind of tasks. For example, Eigenface uses PCA since it cares more about underlying bases than reconstruction.

Review of Unsupervised Feature Learning

Written by Joel Chao