Principal Component Analysis — An excellent Dimension Reduction Technique

Souvik Majumder
Analytics Vidhya
Published in
6 min readApr 7, 2020

This article will discuss about the basic significance of Principal Component Analysis for reducing dimensions in a dataset. I’ll try to keep the explanation as simple as possible for everyone to have a better understanding, without digging much deeper into mathematics.

Now before we start, let us have a brief understanding of what Dimension Reduction actually means.

Let us assume that we have a large data set, on which we need to apply some Machine Learning algorithm and finally predict the outcome. By the word large, it means that the dataset might contain a huge number of variables or what we call features. These features are also called Dimensions.

Let us assume that we need to apply a simple Linear Regression on that dataset. As a result, the equation of linear regression model looks like below, considering that the dataset has 35 independent variables (x).

The model would take all the 35 features and put them in the equation to predict the output.

But wouldn’t it create any problem, taking all the 35 features to predict an output every time ?

Definitely, it would. There can be some unnecessary columns or features which actually do not provide any vital information. The reasons could be as follows:

  • A large amount of missing values present in a variable or a column. In such case, the variable is not much of use.
  • An extremely low variance present in a variable. This means, for that particular variable, there is not much variation in data in all the rows.
  • More than one variable or column might have similar trends and are likely to carry similar information. This is called High Correlation.

Normally, is such cases, we can directly drop those variables. Reducing some amount of variables from the dataset, is called Dimension Reduction.

But what if, a majority of the variables have high correlation ? What if, by imputing a missing variable, we are basically tampering with the actual information ? Wouldn’t it lead to loss of information in the data ?

So what do we do to avoid that ?

The solution is Principal Component Analysis.

What is Principal Component Analysis ?

It is a technique which helps us in extracting a new set of variables from an existing large set of variables, in order to segregate only those set which contains a considerable higher amount of information, let’s say 80 %.

The newly extracted variables are called principal components. These components are always linear functions of the independent variables.

All k principal components together will capture 100% of the information of the original data. However, we do not know, how many principal components would be sufficient to select.

In other words, the number of initial principal components is unknown or None.

Every Principal Component will have different magnitudes of coefficients.

Let’s say my business requirement gives me the permission to capture 80 % of the vital information. In such case, I could see that a collection of PC1, PC2 and PC3 would be sufficient.

Therefore, we set the number of components to 3. As a result the number of features that are needed before applying ML, reduces from 35 to 3.

Programmatic Explanation

Let us have the Breast Cancer dataset for instance.

We transform all the original x-features to the same scale.

We print the current shape of the original x-features.

The above tells us that the total number of columns or independent features in the dataset is 30.

Now we start applying Principal Component Analysis. However, as stated earlier, we are unknown of the initial number of principal components needed to be considered. So we keep the default value as None.

We print the shape of the principal component object.

It should be 30, since we are feeding initially all the columns together.

We figure out the variance or the information content in percentage for all the features fitted in the PCA object.

We can observe that the first 5 principal components sum up to around 84 % of the information.

This tells us that we can consider the optimum number of principal components to be 5.

So, we execute Principal Component Analysis again, while this time, we set the number of components to be 5.

We transform the original x-features of 30 variables into that of 5 variables with the help of the PCA object.

Let us now print the shape of the transformed x-features.

Now we can proceed to apply Machine Learning, while considering the transformed PCA dataframe as the input. Since, this is a classification problem, let us choose Logistic Regression Algorithm for training the model.

In order to see the difference in model improvement, we will apply Logistic Regression on both original and transformed features.

Let us consider the ROC- AUC score as the evaluation metric for both the scenarios.

As a matter of surprise, we found that the ROC AUC Score with the transformed features is lesser than that with the original features. However, one of the main benefits of applying PCA is that it improves ROC AUC Score.

So, in order to improve the score, we again perform PCA, by gradually increasing the number of principal components, i.e., n_components

At n_components = 11 and 12, we found that the ROC AUC Score with the transformed features matches than that with the original features. Beyond this number, the score again starts decreasing.

At n_components = 12,

At n_components = 13,

At n_components = 11,

At n_components = 10,

From the above observations, it is evident the ROC AUC Score can be kept the same as the original features by having number of principal components to be 11.

Therefore, it means that we can avoid 30–11 = 19 additional and unnecessary features from the dataset by dimension reduction process through Principal Component Analysis. This will further improve the model training and prediction time.

Another advantage of PCA includes reduction of noise in the data.

--

--

Souvik Majumder
Analytics Vidhya

Full Stack Developer | Machine Learning | AI | NLP | AWS | SAP