Principal Component Analysis (PCA)— Part 1 — Fundamentals and Applications

Nitish Kumar Thakur
Analytics Vidhya
Published in
8 min readFeb 17, 2021

Principal Component Analysis is among the most popular, fastest and easiest to interpret Dimensionality Reduction Techniques which exploits the Linear Dependence among variables. Some of its applications are:

  • Decorrelating Variables; Making the features Linearly independent
  • Outlier/Noise Removal
  • Data Visualization
  • Dimensionality Reduction

In the following article we will discuss the applications and why PCA works.

Why does Dimensionality Reduction using PCA Work?

Dimensionality reduction using PCA works because of the presence of Collinearity(Or Linear Dependence among features) in data. Let us see what it means. Imagine the following 2 scenarios:

  1. Case A: Variables x1 and x2 are highly collinear(linearly dependent on each other)
  2. Case B: Variables x1 and x2 are Linearly Independent

Let us now plot the scatterplots of x1 vs x2 for the 2 cases:

Case A

Case A: x1 and x2 are linearly dependent

I have drawn a box at the boundary of the plots — to indicate a bounding box within which the data exists.

Here, x1 and x2 have high linear dependence. In this case, it means they have large correlation(2 variables have high linear dependence when they have high correlation. When 3 or more variables have high linear dependence, correlation is not always a reliable measure for measuring the linear dependence — because correlation only calculates the linear dependence between 2 variables at a time). We observe the following:

  1. We see that the data is distributed very close to a straight line(in red) — such that the spread of the data along the line is maximum and the spread of data perpendicular to the line is minimum. Thus, by remembering the spread of the data along the diagonal, we retain most of the information. Actually, this is what happens when we perform dimensionality reduction using PCA — we represent data using directions along which it varies most.
  2. The Majority of the bounding box is empty — The data occupies a very small portion(close to the diagonal) of the box. The remaining part of the box contains no data(in yellow in the below figure).

Case B

Case B: No Linear Dependence between x1 and x2

x1 and x2 have no linear dependence — in this case, it means that they have no correlation. We observe the following:

  1. Data DOES NOT lie along a line — unlike case A where the data stayed close to a line— Hence, there is no special direction along which data varies “more” — all directions are likely to contain data. Hence, it shows no promise of dimensionality reduction.
  2. Most of the area in the box contains data (unlike Case A where we had large empty regions with no data).

Following are some important details:

  1. Linear dependence among variables causes data to lie along lower dimensional subspaces — or hyperplanes as in case A. This causes a large region with no data. In Fact, the stronger the linear relation, the large will be the unoccupied space.
  2. Dependent variables(Including when the relation between variables is non linear)in general, cannot occupy all the space available to them in a bounding box because they vary together along certain directions(hence the dependence). Variable dependence restricts the regions where data can lie. We say dependent variables exist primarily along lower dimensional manifolds. For PCA, we will only be interested in linear manifolds.

What is PCA?

PCA takes a matrix of samples and features as input and returns a new matrix whose features are a linear combination of the features in the original matrix.

  1. These new features generated by PCA are orthogonal(at right angles) to each other.
  2. The new features are sorted in order of decreasing variance. The first PC(Principal Component) explains the most variance. The last PC explains the least variance. In Case A, the first PC would lie along the diagonal and the second PC would lie perpendicular to the diagonal(as mentioned in Point 1)

Data Preparation and interpretation of PCA:

It is important to standardize data before using PCA. PCA measures variance of data along orthogonal directions. If a feature A assumes values in the range of 0–10000 with a Standard Deviation of say 200, and another feature B assumes values in the range of 0–100 with a Standard Deviation of say 20, naturally feature A would contribute more in deciding the direction of maximum variance — simply due to its large Variance. For example, a change of 100 units causes only 1%(1% of range of Feature A) change in feature A while it causes 100%(100% of the range)change in feature B.

Let us briefly see how to interpret the components produced by PCA. We fit PCA to 3 features chosen from the Boston housing dataset: LSTAT, RM, AGE

Loadings of the PCs. Figure A

Since the original data has 3 columns(or coordinates), our transformed data(in the Principal component space) will also have 3 columns(or coordinates).

The above table shows us the importance/contribution of each feature in forming each coordinate of the transformed data. This means that LSTAT has a weightage of .6564, RM has a weightage of -.5365 and Age has a weightage of .5304 in calculation of the first coordinate of the transformed data. The first coordinate corresponds to the first Principal Component.

This can be used as a basis for feature selection as mentioned in Applied Predictive Analytics — We can retain the feature which has the highest loading in each Principal Component. For example, in the loadings table, LSTAT, AGE, LSTAT have the highest loadings of .6564, .7150, -.7544 in the 1st, 2nd and 3rd Principal Components respectively. So, we can select LSTAT and AGE as features for modelling purposes. The advantage here, is that we don’t need to transform the data into Principal Component space to reduce the number of features — so we retain interpretability(which is great because performing PCA reduces interpretability of the ML Pipeline).

Applications of PCA

1. Removing Collinearity and Correlation in Data

Transforming data using PCA de-correlates the variables. In other words, it forces the transformed features to have no correlation among them. Let us use the Boston Housing Data to check the correlation matrix before and after using PCA.

Before PCA:

x = pd.DataFrame(datasets.load_boston().data, columns = datasets.load_boston().feature_names)# Generate heatmap of the correlation matrix
plt.figure(figsize = (12.5, 7.5))
sns.heatmap(x.corr().round(3), vmax = 1, vmin = -1, fmt = '.2f', annot = True, linecolor = 'white',
linewidth = .1, annot_kws = {'fontsize': 12, 'weight': 'bold'}, cmap = 'coolwarm')
plt.xticks(fontsize = 12)
plt.yticks(fontsize = 12);
plt.title('Correlation of Variables BEFORE transforming them using PCA', fontsize = 14, weight = 'bold')
Correlation between Variables before transforming them using PCA

We can see multiple correlated features. For example, RAD and TAX have high correlation. Let us transform this data using PCA and plot the correlation matrix.

After PCA:

## Load Boston Housing Dataset
from sklearn import datasets
x = pd.DataFrame(datasets.load_boston().data, columns = datasets.load_boston().feature_names)
# Scale Data to have zero mean and unit variance
scaler = preprocessing.StandardScaler().fit(x)
x = pd.DataFrame(scaler.transform(x), columns = x.columns)
# Fit PCA
pca = decomposition.PCA().fit(x)
# Get transformed data
x_transformed = pd.DataFrame(pca.transform(x), columns = np.arange(1, x.shape[1]+1))
# Generate heatmap of the correlation matrix
plt.figure(figsize = (12.5, 7.5))
sns.heatmap(x_transformed.corr().round(3), vmax = 1, vmin = -1, fmt = '.2f', annot = True, linecolor = 'white',
linewidth = .1, annot_kws = {'fontsize': 12, 'weight': 'bold'}, cmap = 'coolwarm')
plt.xticks(fontsize = 12)
plt.yticks(fontsize = 12);
plt.ylabel('Principal Component', fontsize = 12, weight = 'bold')
plt.xlabel('Principal Component', fontsize = 12, weight = 'bold')
plt.title('Correlation of Variables AFTER transforming them using PCA', fontsize = 14, weight = 'bold')
Correlation matrix after PCA — All the cross-correlations are zero

Transforming the data using PCA — removed the collinearity that was previously present in the data. However, since PCA only removes linear dependence among variables; even after transforming the variables using PCA, they may still be dependent — in a non linear way.

2. Noise Removal and Outlier Detection

This is an important application of PCA — and is a special case of a family of Anomaly Detection methods. This is a multivariate method of Outlier Detection. This differs from univariate outlier removal methods in that univariate methods(like z-score method, tukey’s method) consider each variable independently while detecting outliers — PCA, on the other hand, detects outliers by simultaneously considering values of all the variables.

Let us assume our data has 10 columns. The idea is to do the following:

  1. Transform Data into the Principal Component Space.
  2. Retain the Principal Components which explain the largest variance(say ~99% of the variance). So for example let us say we retain 9 columns(i.e. 9 principal components).
  3. Now, Transform the data back to the original space using only the information contained in the retained Principal Components. Since we eliminated one component in step 2, we cannot hope for perfect reconstruction of the original data.
  4. Measure the error in reconstruction for each observation. Most of the data must be accurately reconstructed — as we retained most of the variance. The observations which were poorly reconstructed are potential outliers.

Following are 2 hyperparameters of the above algorithm:

  1. Percentage of variance to be retained in step 2
  2. Threshold reconstruction error for tagging observations as anomaly/Outlier

As always, it is often better to analyze the Outliers detected using the above method from a domain perspective.

3. Data Visualization

Since the first 2 Principal Components explain most variance, we can visualize the data through a scatterplot of the first 2 Principal Components.

The first 2 Principal Components provide the best 2 Dimensional Approximation of data provided we are only allowed to use linear transformations to create the components and our loss is MSE.

## Load Boston Housing Dataset
from sklearn import datasets
x = pd.DataFrame(datasets.load_boston().data, columns = datasets.load_boston().feature_names)
# Scale Data to have zero mean and unit variance
scaler = preprocessing.StandardScaler().fit(x)
x = pd.DataFrame(scaler.transform(x), columns = x.columns)
# Fit PCA - Retain only 2 components
pca = decomposition.PCA(n_components = 2).fit(x)
# Get transformed data
x_transformed = pd.DataFrame(pca.transform(x), columns = ['Principal Component 1', 'Principal Component 2'])
# Generate Scatter plot of the first 2 principal components
plt.figure(figsize = (12.5, 7.5))
#plt.scatter(x_transformed.iloc[:, 0], )
x_transformed.plot.scatter(x = 'Principal Component 1', y = 'Principal Component 2', grid = True, figsize = (10, 6),
fontsize = 14)
plt.xlabel('Principal Component 1', fontsize = 14)
plt.ylabel('Principal Component 2', fontsize = 14)
Visualization of Boston Housing Data containing 13 Features — using the first 2 Principal Components

Such a Visualization can be used to visualize clusters. Often, a third variable is used to add color. In Cluster Analysis, the color variable can be the cluster ID. A 3d Plot including the first 3 Principal Components can also be made — despite the fact that it can be difficult to interpret occasionally.

Summary

To summarize, Principal Components can be used to Remove Noise/Detect Outliers, Decorrelate variables, Visualize high dimensional data in 2 dimensions. They provide the best p(p < Number of Features in data) Dimensional Approximation of data provided we are only allowed to use linear transformations to create the components and our loss is MSE.

There are a lot of variants of PCA — which are optimal to use in different circumstances. I will be making a post about the variants of PCA in the Next Post.

--

--