Journey From Principle Component Analysis To Autoencoders

Sep 10, 2018 · 5 min read


We are aware power of PCA and wonders that were achieved by it domains of image compression, data analysis and supervised machine learning. PCA explains correlated multivariate data in fewer number of linearly uncorrelated variables which are linear combination of original variable. In simple terms it removes redundant information with catching most important features first i.e. the ones with highest variance. Here, are interesting examples by Victor to get to started with on importance of PCA, you can follow along with his series for mathematics behind PCA. First see the two images, answer the question and then look at explanation.

For some population two properties are measured.What is the dimension of data ?
Let’s say 20x20 image digit dataset, Leading to 400 grid boxes. What is the dimension of data ?

Power of PCA Explained

Let’s say you have a dataset involving a huge number of features to work with. It is mostly the case with image, video or audio data. You can’t apply our regular machine learning models directly onto it, you will definitely look for some meaningful preprocessioning step which will reduce your training time but still not reduce representation power of data i.e. accuracy/error trade-off must be less.

100 x 100 images [10K Dimensions]. Data will have m x 10k dimensional matrix
Result from first 16 eigen vectors.

How autoencoders are linked to PCA ?

Well, an autoencoder is a special type of feed forward neural network which encodes input x into hidden layer h and then decodes it back from its hidden representation. The model is trained to minimize the loss b/w the input and output layer.

An autoencoder where dim(h) < dim(xi) is called an under complete autoencoder
How to achieve this equivalence? Can this equivalence be of any use?

Which conditions make autoencoder a PCA?

Encoder part will be equivalent to PCA if linear encoder, linear decoder, square error loss function with normalized inputs are used. Which means PCA is restricted to linear maps only whereas autoencoders are not.

A Comparative Illustration

Here are comparative codes of PCA and an autoencoder for training MNIST dataset with mean square error as loss function. Following architecture 784→512→128→2→128→512→784 for autoencoders is being used in this illustration. In Python 3.x you need to install TensorFlow and Keras for this code illustration.

import numpy as np
import keras
from keras.datasets import mnist
from keras.models import Sequential, Model
from keras.layers import Dense
from keras.optimizers import Adam

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = x_train.reshape(60000, 784) / 255
x_test = x_test.reshape(10000, 784) / 255
mu = x_train.mean(axis=0)
U,s,V = np.linalg.svd(x_train - mu, full_matrices=False)
Zpca = - mu, V.transpose())

Rpca =[:,:2], V[:2,:]) + mu # reconstruction
err = np.sum((x_train-Rpca)**2)/Rpca.shape[0]/Rpca.shape[1]
print('PCA reconstruction error with 2 PCs: ' + str(round(err,3)));
m = Sequential()
m.add(Dense(512, activation='elu', input_shape=(784,)))
m.add(Dense(128, activation='elu'))
m.add(Dense(2, activation='linear', name="bottleneck"))
m.add(Dense(128, activation='elu'))
m.add(Dense(512, activation='elu'))
m.add(Dense(784, activation='sigmoid'))
m.compile(loss='mean_squared_error', optimizer = Adam())
history =, x_train, batch_size=128, epochs=5, verbose=1, validation_data=(x_test, x_test))
encoder = Model(m.input, m.get_layer('bottleneck').output)
Zenc = encoder.predict(x_train) # bottleneck representation
Renc = m.predict(x_train) # reconstruction


Above theory and example gives an idea why there was a need to move from PCA to autoencoders. Presenting the case for non-linear functions as compared to linear transformations as in PCA. Vanilla encoders no doubts are primitive in terms of its functioning even on MNIST like datasets where sparse or denoising autoencoders will simply overpower in case of over complete autoencoders for example. But, still the foundation of non-linear activation functions paves the way deep learning models to outperform other convention ML models. Thanks!!


I have merely scratched the surface & tried to explain the basics of this concept, for deep dive into theory refer CS7015(DL, IITM) and for details of implementation refer following thread by amoeba.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Sign up for Analytics Vidhya News Bytes

By Analytics Vidhya

Latest news from Analytics Vidhya on our Hackathons and some of our best articles! Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem


Written by

Data Mining | Data Science | Machine Learning | Deep Learning | Artificial Intelligence

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store