Guide to Principal Component Analysis

Mathanraj Sharma
Aug 15, 2019 · 4 min read

PCA is a buzz word which always pops up at many stages of Data Analysis for various uses. But what is really behind this Principal Component Analysis? What are the uses of it? Let’s see one by one.

First, let us understand what is done behind the screen called PCA. I have attached a Jupyter Notebook with examples at the bottom of the article.

Step-1: Decorrelating Data Points

  • What is the dimensionality of this graph? (Fig-1)

Yes, you are correct, it is a 2D graph.

  • What do you think about this (Fig-2) graph?

Yes, it is a 1D graph, because, for every x, the y value is the same.

Let’s move to a tricky question — if you can guess this one you have successfully pass the first step in PCA.

(Fig-3) What might be the Dimensionality of this Graph?

If you say 2D, nope that is wrong. It is 1D — What? But how?

Fig-6 (Rotated Axis/data)

Yes, if we rotate the axis a bit the data points will align as same as our previous case (Fig-2). This is done in following sub-steps,

  • Rotates data samples to be aligned with the axis
  • Shift data samples so they have mean 0 (it moves the center of the axis to the center of the data)
  • It will put the principal axis on the direction where data varies more and the second axis on the less varying direction.
  • There will be no information loss in this step but the correlation between the features will be lost.
Fig-5, PC Datacamp

Step-2: Dimensionality Reduction

One of the major uses of PCA is Dimensionality reduction. Assume you have dozens of features on your dataset. But unfortunately, we can only,

  • Visualize up to 3D data using matplotlib
  • Storing more data with fewer correlations is consumes a lot of storage
  • Computing less informative data is useless
  • Also, they may cause problems for predictions tasks, i.e. make predictions meaningless

So somehow we need to reduce the noise of data and filter out the meaningful features for our task. This where the PCA comes in handy.

  1. When we apply PCA on a dataset, first it will decorrelate and rotate the dataset to find the principal direction.
  2. Then it will measure the linear correlation of features, values between -1 and 1. If a value is 0 no correlation.
  3. PCA will align principal components with the rotate axis

Step-3: Intrinisic dimension

Intrinsic dimension is the number of features needed to approximate a dataset. This is the key idea behind the dimension reduction. If we need to reduce the dimension we should know which features we should select and which to neglect.

Usually, PCA in Sklearn will automatically find the number of Intrinsic Dimensions itself. It will consider PCA features with significant variance. We can also explicitly specify the number of components to be considered.

  • Discards low variance PCA features
  • Assumes the high variance features are informative

Hope you have found some basic idea of what happens behind the screen of PCA. Please go through this Jupyter Notebook for examples to get a clear picture. Feel free to ping me if you have anything to discuss.

Sometimes python notebooks are not loading in GitHub properly. In such cases download the Notebook and open it using your local Jupyter environment.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…