Part 2: Comparing dimensionality reduction between autoencoders and Principal Component Analysis (PCA)

7 min readApr 24, 2020

Introduction

In this four-part series, we will be working through the Kaggle Aerial Cactus Identification challenge from 2019. The series will cover the following topics:

Part 1: Pre-processing and dimensionality reduction with autoencoders
Part 2: Comparing dimensionality reduction between autoencoders and principle component analysis (PCA)
Part 3: Model selection and implementation with convolutional neural networks (CNN)
Part 4: Model interpretation with LIME and concluding remarks

This series aims to show an example implementation of the full data science pipeline for an image classification problem.

All the code for this series is available in this GitHub repository.

In the previous section, we looked at how to use autoencoders for dimensionality reduction which is a common technique in image classification problems. In this section, we will be comparing a different dimensionality reduction technique, Principal Component Analysis with the autoencoder we saw in the previous section.

Principal Component Analysis

The goal of principal component analysis (PCA) is to identify patterns in high dimensional data to project these into lower dimensionality without losing important information. On a macroscopic level, PCA works by identifying variables that are highly correlated in data and when such variables are identified, projecting these variables into a smaller dimensional subspace and by doing so reducing the dimensionality of the dataset. Working with lower-dimensional datasets is important for improving computational efficiency when working with large datasets such as image data.

So how does PCA work? Using the dataset at hand, the first step is to obtain a covariance matrix or correlation matrix of all the variables in your dataset and use the given matrix to obtain the corresponding eigenvectors. Each of these eigenvectors represents a “principal component” of the data containing some information about the dataset. The magnitude of each of these eigenvectors is calculated and represents the degree with which the associated principal component is informative. Eigenvectors and eigenvalues with the lower magnitudes are deemed less informative and can, therefore, be dropped to reduce the dimensionality of the dataset. The projection of the dataset to a lower subspace is a linear transformation in Principal Component Analysis. The reduced dataset maximizes the amount of variance captured in the dataset and can, therefore, serve as an accurate generalization of the data.

One big difference between principal component analysis and autoencoders is that autoencoders can utilize non-linear activation functions at the different layers of the neural network whereas, in PCA, dimensionality reduction is done in a linear transformation. An autoencoder with three layers, the encoding layer, the hidden layer, and the decoding layer, having a linear activation function would give you a similar dimensionality reduction to PCA. However, the use of non-linear activation functions such as the sigmoid function is what makes the use of autoencoders that much more powerful for learning patterns in data.

Implementation

Before making a comparison between the dimensionality reduction in autoencoders and principal component analysis, we will explore how two different implementations of PCA perform in the task of maximizing the variance. We will be using the implementation of principal component analysis by sci-kit learn and further pre-processing the data before implementing any principal component analysis model on our data. To do that we first import the necessary packages.

Reshaping and Normalizing the Data

It is important to normalize your data before running any implementation of PCA. Since PCA maximizes variance, the covariance matrix from which the variance of each component is determined can be skewed with some variables having much more variance than others and a disproportionate scale can result in the disproportionate selection of components to project onto a subspace of lower dimensionality and in doing so, important information about your dataset will be dropped reducing the ability of your model to learn better general patterns down the line.

The process of normalizing your data is as simple as initializing your scaler of choice, in this case, we use the standard scaler available in the sci-kit learn library and fit your data over the scaler.

Listing 1. Reshaping and Normalizing the data

Implementing Two PCA Models

We will compare two implementations of PCA, one with 95% of the components retained and another version with a specified number of 10 components, and the dimensionality reduction of both implementations with the dimensionality reduction of the autoencoder from Part 1 of this series. Before the transformation, the data contains 96 components. After implementing a PCA with 95% of the components, the data is reduced to 25 components.

Listing 2. Implementing 2 versions of PCA one retaining 95% of components, the other retaining only 10 components.

Comparing the Two PCA Models

The effect of the dimensionality reduction in both implementations can be visualized clearly by performing an inverse transformation on the reduced data and plotting it against the original dataset.

Listing 3. Comparing the effect of dimensionality reduction between the two implementations.

From the images below, we can see that the original dataset (in red) contains more components than either of the reduced datasets and the dataset with 95% of components has more information that the dataset that retained only the 10 “most informative” components.

Figure 1 & 2. On the left, we have the reduced dataset after implementing a PCA with 95% of the components retained. One the right we have the reduced dataset after implementing a PCA with 10 components retained.

The overall relationship between points in the dataset is retained, which tells us that this dimensionality reduction does not misrepresent the information contained in the data and we can confidently use this data to train our models.

Analyzing Image Outputs

Now, we look at the differences in the images produced after dimensionality reduction. To do that, we visualize the reduced images, using only the retained principal components. We do this for both implementations and compare the results.

Listing 4. Analyzing the reduced images obtained from the two different PCA models

The results obtained from the two different PCA models are shown below. The images created by the two different data reduction techniques are nearly identical. There is no clear visible difference or distinction in the amount of variance captured by either of the reduced images.

Figure 3. Above, we have the images obtained from the reduced dataset after implementing a PCA with 95% of the components retained.

Figure 4. Above we have the images obtained from the reduced dataset after implementing a PCA with 10 components retained.

Analyzing the Variance Retained

What this tells us is that there is a high possibility that a majority of the variance in our dataset can be capture with 10 of the most informative components or less. We can confirm this hypothesis by retrieving the information on the explained variance ratio from each of our implementations and examine his information. This can be done by calling the explained_variance_ratio_ method as shown below. The results show us that almost 50% of the variance is captured by only the first principal component.

Listing 5. Retrieving the information on the explained variance from each of the PCA models to determine the variance retained by both models

We can also fit our data on a PCA initialized with no parameters and plotting the number of components against the explained variance ratio and from that graph we can see how many components we need to capture varied levels of variance from our dataset. From the graph below, we can conclude that less than 5 components can capture ~70% of the variance.

Figure 5. A plot of the number of components against the cumulative explained variance obtained from the components. This shows that 70% of the variance is explained with less than 10 components and 100% of the variance in this dataset is explained by ~35 components.

PCA vs Autoencoder for Dimensionality Reduction

When we compare the results of the autoencoder obtained in Part 1, as shown below, with any other images obtained from the PCA reduced images, we can see that to the naked eye, PCA captures less information per pixel than the autoencoder.

Figure 6. Representation of a simple autoencoder architecture with neural networks (Jordan, 2018)

The autoencoder with three layers takes advantage of the non-linear activation function and can, therefore, create superior feature maps to the linear PCA implementations as shown below.

Figure 7 & 8. From top to bottom: The original images of cacti, the images obtained using an autoencoder for dimensionality reduction and the bottom two rows represent the images obtained after dimensionality reduction using PCA.

Conclusion

In this part of the tutorial, we have implemented two Principal component Analysis models, one retaining 95% of the variance, the other retaining only 10 components. We analyzed the output of these models and compared the reduced images. We then compared the reduced images of the PCA models with the reduced images from the autoencoder in Part 1 of this series.

Both Principal Component Analysis and Autoencoders have their benefits and drawbacks and should be implemented based on their suitability to the problem at hand. PCA is suitable for problems where the image data is not complex as the activation function is linear and therefore, is unable to capture as much variance. When working with more detailed images consisting of various shapes, an autoencoder may be more suitable as the activation function is non-linear and therefore, can capture more variance.

The code for Part 2 of this series is available in this Github gist.

References

Barkai, K. (2020, April 23). Part 1: Pre-processing and dimensionality reduction with autoencoders for image classification. Retrieved from https://medium.com/@kalia_65609/part-1-pre-processing-and-dimensionality-reduction-with-autoencoders-for-image-classification-82e4d273619e

Brems, M. (2019, June 10). A One-Stop Shop for Principal Component Analysis. Retrieved from https://towardsdatascience.com/a-one-stop-shop-for-principal-component-analysis-5582fb7e0a9c

Prabhakaran, S. (2020, March 22). Principal Component Analysis (PCA) — Better Explained: ML. Retrieved from https://www.machinelearningplus.com/machine-learning/principal-components-analysis-pca-better-explained/

sklearn.decomposition.PCA. (n.d.). Retrieved from https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html