PCA is a buzz word which always pops up at many stages of Data Analysis for various uses. But what is really behind this Principal Component Analysis? What are the uses of it? Let’s see one by one.
First, let us understand what is done behind the screen called PCA. I have attached a Jupyter Notebook with examples at the bottom of the article.
Step-1: Decorrelating Data Points
- What is the dimensionality of this graph? (Fig-1)
Yes, you are correct, it is a 2D graph.
- What do you think about this (Fig-2) graph?
Yes, it is a 1D graph, because, for every x, the y value is the same.
Let’s move to a tricky question — if you can guess this one you have successfully pass the first step in PCA.
If you say 2D, nope that is wrong. It is 1D — What? But how?
Yes, if we rotate the axis a bit the data points will align as same as our previous case (Fig-2). This is done in following sub-steps,
- Rotates data samples to be aligned with the axis
- Shift data samples so they have mean 0 (it moves the center of the axis to the center of the data)
- It will put the principal axis on the direction where data varies more and the second axis on the less varying direction.
- There will be no information loss in this step but the correlation between the features will be lost.
Step-2: Dimensionality Reduction
One of the major uses of PCA is Dimensionality reduction. Assume you have dozens of features on your dataset. But unfortunately, we can only,
- Visualize up to 3D data using matplotlib
- Storing more data with fewer correlations is consumes a lot of storage
- Computing less informative data is useless
- Also, they may cause problems for predictions tasks, i.e. make predictions meaningless
So somehow we need to reduce the noise of data and filter out the meaningful features for our task. This where the PCA comes in handy.
- When we apply PCA on a dataset, first it will decorrelate and rotate the dataset to find the principal direction.
- Then it will measure the linear correlation of features, values between -1 and 1. If a value is 0 no correlation.
- PCA will align principal components with the rotate axis
Step-3: Intrinisic dimension
Intrinsic dimension is the number of features needed to approximate a dataset. This is the key idea behind the dimension reduction. If we need to reduce the dimension we should know which features we should select and which to neglect.
Usually, PCA in Sklearn will automatically find the number of Intrinsic Dimensions itself. It will consider PCA features with significant variance. We can also explicitly specify the number of components to be considered.
- Discards low variance PCA features
- Assumes the high variance features are informative
Hope you have found some basic idea of what happens behind the screen of PCA. Please go through this Jupyter Notebook for examples to get a clear picture. Feel free to ping me if you have anything to discuss.
Sometimes python notebooks are not loading in GitHub properly. In such cases download the Notebook and open it using your local Jupyter environment.