# Introduction

We are aware power of PCA and wonders that were achieved by it domains of image compression, data analysis and supervised machine learning. PCA explains correlated multivariate data in fewer number of linearly uncorrelated variables which are linear combination of original variable. In simple terms it removes redundant information with catching most important features first i.e. the ones with highest variance. Here, are interesting examples by Victor to get to started with on importance of PCA, you can follow along with his series for mathematics behind PCA. First see the two images, answer the question and then look at explanation. For some population two properties are measured.What is the dimension of data ? Let’s say 20x20 image digit dataset, Leading to 400 grid boxes. What is the dimension of data ?

# Power of PCA Explained

Let’s say you have a dataset involving a huge number of features to work with. It is mostly the case with image, video or audio data. You can’t apply our regular machine learning models directly onto it, you will definitely look for some meaningful preprocessioning step which will reduce your training time but still not reduce representation power of data i.e. accuracy/error trade-off must be less.

# How autoencoders are linked to PCA ?

Well, an autoencoder is a special type of feed forward neural network which encodes input x into hidden layer h and then decodes it back from its hidden representation. The model is trained to minimize the loss b/w the input and output layer. An autoencoder where dim(h) < dim(xi) is called an under complete autoencoder How to achieve this equivalence? Can this equivalence be of any use?

# Which conditions make autoencoder a PCA?

Encoder part will be equivalent to PCA if linear encoder, linear decoder, square error loss function with normalized inputs are used. Which means PCA is restricted to linear maps only whereas autoencoders are not.

# A Comparative Illustration

Here are comparative codes of PCA and an autoencoder for training MNIST dataset with mean square error as loss function. Following architecture 784→512→128→2→128→512→784 for autoencoders is being used in this illustration. In Python 3.x you need to install TensorFlow and Keras for this code illustration.

`import numpy as npimport kerasfrom keras.datasets import mnistfrom keras.models import Sequential, Modelfrom keras.layers import Densefrom keras.optimizers import Adam(x_train, y_train), (x_test, y_test) = mnist.load_data()x_train = x_train.reshape(60000, 784) / 255x_test = x_test.reshape(10000, 784) / 255`
`mu = x_train.mean(axis=0)U,s,V = np.linalg.svd(x_train - mu, full_matrices=False)Zpca = np.dot(x_train - mu, V.transpose())Rpca = np.dot(Zpca[:,:2], V[:2,:]) + mu    # reconstructionerr = np.sum((x_train-Rpca)**2)/Rpca.shape/Rpca.shapeprint('PCA reconstruction error with 2 PCs: ' + str(round(err,3)));`
`m = Sequential()m.add(Dense(512,  activation='elu', input_shape=(784,)))m.add(Dense(128,  activation='elu'))m.add(Dense(2,    activation='linear', name="bottleneck"))m.add(Dense(128,  activation='elu'))m.add(Dense(512,  activation='elu'))m.add(Dense(784,  activation='sigmoid'))m.compile(loss='mean_squared_error', optimizer = Adam())history = m.fit(x_train, x_train, batch_size=128, epochs=5, verbose=1, validation_data=(x_test, x_test))encoder = Model(m.input, m.get_layer('bottleneck').output)Zenc = encoder.predict(x_train)  # bottleneck representationRenc = m.predict(x_train)        # reconstruction`

# Conclusion

Above theory and example gives an idea why there was a need to move from PCA to autoencoders. Presenting the case for non-linear functions as compared to linear transformations as in PCA. Vanilla encoders no doubts are primitive in terms of its functioning even on MNIST like datasets where sparse or denoising autoencoders will simply overpower in case of over complete autoencoders for example. But, still the foundation of non-linear activation functions paves the way deep learning models to outperform other convention ML models. Thanks!!

# Acknowledgements

I have merely scratched the surface & tried to explain the basics of this concept, for deep dive into theory refer CS7015(DL, IITM) and for details of implementation refer following thread by amoeba.

## Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

### By Analytics Vidhya

Latest news from Analytics Vidhya on our Hackathons and some of our best articles! Take a look.

Medium sent you an email at to complete your subscription.

## Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Written by

## ASHISH RANA

Data Mining | Data Science | Machine Learning | Deep Learning | Artificial Intelligence ## Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

## More From Medium

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium