noBS Learning: Principal Component Analysis

Anupam vashist
Karma and Eggs
Published in
5 min readMay 14, 2018

In a previous article we learned about predicting numbers from a 28x28 image. We used the mnist dataset that contained features as the intensity of these 28x28 (784) pixels and labels as the actual number it represents.

I have a question. what if the image was of 280x280 pixels? Do you think our model would have learnt as fast as before? Or if the image was of 5X5 pixels? would our model be as good at prediction?

Let’s go over a simple exercise here. below is a picture of Abraham lincoln in a pretty good resolution. you can almost look at every feature of his in detail and decide that yeah, this indeed is Lincoln.

Abraham lincoln, as found on Atlanta Black Star

Now we pixelate this image and remove some- well, a lot of features. The image is blurry AF but do one thing. Look at the pixelated image with your eyes opened only as narrow as a slit. I bet you can see lincoln below despite the resolution tuned down to bricks.

Clap if you identify it, but why is it so?

Well, the first image contains a lot of features, evidently most of which are correlated to each other or are unnecessary. The pixelated image captures the essence of the high resolution one by combining or dropping the unnecessary features to the point that we, as predictors can take a decision.

What happens if we drop more and more features? Let’s see if you can predict what’s in the image below.

Here we have shamelessly pixelated the same image to the point where decisive features are getting lost. Now it looks like a snapshot from an 8 bit game and as predictors, we can get puzzled although it is the same image.

NoBS Dimentionality reduction

Leaving out the technicalities and maths behind for a moment, a simple definition of dimentionality reduction is to see how our dataset can be represented by some combination of original features, so that our model can still predict the outcome with required efficiency. Let’s look at this with our image recognition example using a method called Principal Component Analysis.

As before, we could fetch MNIST data from this source or directly download it with following code.

Once downloaded, let’s peek at how this data looks like.

Here- it seems like we have Data as pixel matrix and target corresponding number.

We know that there are 70000 images and 70000 labels in the dataset from before, the images being stored in 28 x 28 pixel (784) format.

Let’s split the dataset into test and train sets:

Next step is to Normalise our features which is quite necessary for PCA.

This is because, the original predictors may have different scales. For example: Imagine a data set with variables’ measuring units as gallons, kilometers, nanometers etc. It is definite that the scale of variances in these variables will be large.

Performing PCA on un-normalized variables will lead to insanely large loadings for variables with high variance. In turn, this will lead to dependence of a principal component on the variable with high variance. This is undesirable.

(Pardon the angry red text. this is completely harmless). Note that we fit the scaler to only training set and transform both test and training sets using the same fit. why? because test data is practically unknown to the model and we do not want any kind of fitting to do have anything to do with test. Test should be a surprize. aint it?

Let’s move on to PCA now. Below we adjust our PCA instance to retain as much features so that the total loss of variance is not above 10%.

note the same fit and transform logic.

Remember how our initial train and test data had 784 features? Let’s see how many features we have boiled down to.

et voila! we now have 233 features instead of 784. that make our calculations whole lot easy and our model more efficient. we will see if it is as effective.

Here’s a sneak peek at final training and test set post-PCA.

Time to test our PCA decomposition.

We will use the same Logistic regression — lbgfs model as we did in the noBS Logistic Regression.

Let’s see if it can predict correctly. I can randomly predict 303rd test X row and check it against it’s Y label.

Seems legit.

We are almost done but before we conclude anything let’s compare the accuracy of our model and compare it to a simple Logistic regression WITHOUT PCA (91.3%)

Well, our PCA — Logistic regression model has a similar accuracy to only Logistic regression, even after boiling down to 233 features from 784. The real impact is in the difference between time taken by these 2 approaches.

The Logistic regression only model takes~45 seconds to train, where after applying PCA it took ~12 seconds Without any impact on accuracy.

Cool. Isn’t it?

--

--