PCA: The right way

Published in

SFU Professional Computer Science

10 min readFeb 12, 2022

This blog is written and maintained by students in the Professional Master’s Program in the School of Computing Science at Simon Fraser University as part of their course credit. To learn more about this unique program, please visit {sfu.ca/computing/mpcs}.

Authors: Tanvi, Aman Purohit, Sambhav Rakhe, Elavarasan Murthy

PCA (Principal Component Analysis) certainly is one of the first data pre-processing and transformation technique that comes to mind for dimensionality reduction and we generally tend to apply PCA as if it were some kind of magic. While PCA is a very useful and powerful tool, can we really apply it anytime we need to reduce the dimensions? What does its results represent? How to use them? Are they logical?

In this blog, we will delve deeper and focus on intuitively exploring PCA, with a goal to understand exactly where PCA is a friend and where a foe and you be the judge here!

PCA is defined as an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by some scalar projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. (Jolliffe, 2002)

Let’s simplify it. Intuitively, PCA is like capturing a photograph, where we are trying to capture most of the higher-dimensional world to a lower dimensional photo. How well we do depends on what angle we take the photo from. Like in the pictures below, taking a photo of a house from the side is practically useless, while the front and corner view capture the most, including windows, doors and to some extent the depth too. This is exactly what PCA does, finding the perfect lower dimensional image to capture the most out of our data. But you might think that is PCA really that simple, with all the eigenvectors and covariance concepts involved, isn’t there more to it?

Let’s go through the math of PCA with the new understanding and see if things make sense.

Coming back to our previous example, we want to capture an image (projection) of our data in such a way that maximum possible information is retained in the lower dimension. Our task reduces to an optimization problem of finding a vector u such that the variance of the projection of our data on it is maximum possible. Hence, we get the optimization problem as:

where 𝓍ₕₐₜ is the mean of our training samples. We can rewrite the above optimization equation using the covariance matrix 𝛴 as

If we think about it for a second, variance is all about the spread of data, and the covariance denotes how the spread in one dimension changes with respect to another. Therefore, covariance matrix captures information about the shape/spread of our entire dataset in terms of variance. Hence, the covariance matrix helps us obtain the direction which carries the most variance.

Since 𝛴 is symmetric, maximizing 𝒖ᵀ𝛴𝒖 comes down to looking for the largest eigenvector of 𝛴

Eigenvectors basically denote the axis of a matrix transformation. Therefore, during the transformation, eigenvectors do not get knocked off their path, instead just get stretched or squashed in their original direction by some factor, which is called eigenvalue.

Now since the eigenvector with the largest eigenvalue denotes the direction where maximum stretching/squashing occurred, it gives us the direction of maximum variance. We select perpendicular eigenvectors in order of their eigenvalues and take the projection of our data in their direction, giving us the corresponding Principal Components.

Now let’s take a step back to see where PCA comes into picture of data processing journey to better understand its role.

Where does PCA help ?

Know the enemy and know yourself in a hundred battles you will never be in peril. — Sun Tzu

Generally while dealing with data, there are 2 issues we need to take care of:

1. Curse Of Dimensionality

Did you know that you can be cursed in the field of data science? Well, this happens when you are working on a higher dimensional dataset. To put it in simple terms, too much of something is good for nothing. You might think that having a higher number of features is useful because it contains a lot of information but in the real world it is rarely helpful due to the possibility of noise and redundancy. To be precise, large number of columns/features makes it difficult to analyse data or identify patterns in it and it also leads to high computational costs and overfitting during model training. For example — whenever you perform one hot encoding on a categorical variable with high cardinality, you are most likely to get cursed.

2. Multicollinearity

Generally, in a prediction scenario, we expect the independent variables to have a correlation with the target variable. But, what if you were told that two or more independent variables can also be highly correlated with each other?

Multicollinearity is an undesirable situation where we cannot distinguish the individual effects of independent variables on the target and it makes the coefficients of the variable unstable. Though it may not affect the model fit or performance accuracy, but we lose the significance of the variable and reliability on the results. We can check for the presence of multicollinearity in our data through tests like Variance Inflation Factor(VIF) and Pairwise correlation.

Best ways to handle the above issues

1. Feature Selection

In this process we basically ask the question “Is it worthy enough?”- for each feature/column in our data.

This helps us drop the irrelevant columns and retain only a subset of the original dataset with relevant features, reducing dimensionality of the dataset.

There are predominantly three types of Feature Selection techniques: Filter methods, Wrapper methods and Embedded methods.

The concept of feature selection might seem to be simple but there are drawbacks with this approach. First, it is a greedy technique where it performs all possible combinations of features to select the right subset which can be computationally expensive. Second, it fails to capture the relationship between the features, and hence does not eliminate multicollinearity issues. Finally, we gain no information from those variables which are dropped. In other words, that information is just lost.

2. Feature Extraction

The primary idea of feature extraction is deriving new features from the existing ones without dropping columns, and basically compressing the data by keeping only the most important information.

You guessed it right! PCA is a Feature Extraction technique. And it not only helps us deal with the curse of dimensionality, it also produces completely uncorrelated features, removing multicollinearity.

Quick tips for applying PCA

• If you are in a situation where you are not able to find which column to remove from the feature set, i.e. where we don’t know which features are important but need to reduce dimensions, we can use PCA.

• Do you want to ensure independence among the predictor variables? Then use PCA!

• Do not use PCA if you are not comfortable in making your new variables less interpretable.

• PCA can also be used in data compression and denoising of images. It does not eliminate noise but can reduce it.

• By nature, PCA is an unsupervised linear transformation technique, works only when there is a linear relationship between the features. In case of non linearly correlated data, perform linear transformation methods such as log transform or use Kernel PCA.

• There is no limitation on the number of principal components to keep after implementing PCA. Choosing the first “k” number of eigenvectors that explain at least 85–95% of the variance should be sufficient. This can be decided using scree plot.

Wait, don’t get carried away!!! Does knowing where to use PCA mean you can use it without any worries?

WHAT IF everything is not as important?

PCA uses eigenvalue and like any other eigenvalue method, it is an unweighted square method. To put it in simple terms PCA is based on the assumption that every feature in the dataset is of equal importance. But how often do we find that every feature is equally important?

For example — for predicting the house price, the number of bedrooms in a house is a more important feature than the number of grocery stores near the house. However, if we apply PCA directly it will assume that both the features are equally important.

To solve this problem many solutions are already available in the market. The most common is the weighted PCA. Weighted PCA solves the above-mentioned problem by assigning weights to each feature. For example, by giving more weightage to a number of bedrooms than the number of grocery stores near the house, we can make our house predicting model more accurate.

WHAT IF being important is not enough for PCA?

Let’s assume that the issue of equal importance of features has been taken care of. The core concept used in PCA is variance. But what if the most important features are the ones that do not have the most variance. For example, to predict whether a cricket team will qualify for the semi-finals, the most important feature is the number of wins a team has in the preliminary stages. But this is a feature which would generally have the least variance, and other features like runs scored per match, wickets taken per match will have high variance. Mathematically, it’s not always guaranteed that the eigenvector with the smallest eigenvalue will be the least important feature. As seen in the previous example, there can be scenarios in which the smallest eigenvector may be one of the most important features, making PCA not a favorable choice for such cases.

WHAT IF the variance is varying?

For now, let’s consider that our model with the highest variance is the most important feature. If one hypothesizes that case, then PCA will be a perfect choice for feature extraction, right? Well, there’s more. PCA assumes that every single data point within a feature contributes equally to the variance. However, can there be a chance that each data point can contribute differently? Here comes the most important part you should remember. PCA performs worst with heteroskedastic data. Sticking to the context of this blog, we can say that heteroscedastic data is the data in which elements of the same feature have varying variance, as you can see in the graph below. In such cases as well, PCA is not a favorable choice.

Great! ACHIEVEMENT UNLOCKED — you are now familiar with the intricacies of PCA. Let’s ACTUALLY use PCA.

One thing to always take care of before applying PCA is scaling. Frequently, features which have highly varying magnitudes and range in a dataset can make PCA biased towards the feature with numerically larger values. For example, age will maximum range between 1–100 whereas salary generally lies in 10⁴ and above, making it very critical to scale the features. We can use either normalization or standardization for scaling.

Normalization:

Normalization is a scaling technique to adjust the values to a scale of 0 to 1. It is especially useful when the distribution of underlying data is unknown.

Standardization:

Standardization transforms the columns of data to a mean of zero and standard deviation of one, and is generally useful when the underlying data distribution is gaussian.

Let’s see a basic implementation of PCA without using any libraries just to see what goes behind the simple pca.fit_transorm() —

In our example, we use scikit-learn’s built-in breast cancer dataset which contains 30 features, used to classify the cancer is benign or malignant. Then, we perform standard scaling of data points before applying the PCA algorithm. The scaled dataset is reduced to 2 principal components and visualized using the scatter plot as shown in the figure below.

Conclusion

PCA is a very important and powerful technique. But understanding the concept and the context before applying, is the only way to ensure that PCA will be a friend and not a foe. We hope this article was helpful in addressing that and will help you navigate the nuance of the data world a little better! Thank you for reading the article!😃