The Startup
Published in

The Startup

Athul Anish

Nov 16, 2020

5 min read

Principal Component Analysis

Introduction to PCA

A prism showing the splitting of light beam into rainbow colours.
Picture 1

The principal component analysis follows a set of statistical mathematical procedures to transform a set of correlated variables into a set of unrelated variables called principal components.

Traditionally principal component analysis is done on a symmetric square matrix and is often confused with a similar multivariate analysis procedure called Factor Analysis.

It can be a pure sum of squares and cross-product or SSCP matrix, scaled sum of squares and cross product Covariance matrix, the sum of squares and cross products from standardized data i.e. Correlation matrix

Assumptions

  • Sample size: Ideally, there should be at least 150 cases and there should be a ratio of a minimum five cases for each variable
  • Correlations: Some kind of correlation must exist between the variables or factors to be considered for PCA
  • Linearity: A linear relationship must exist between the variables
  • Outliers: There should be no outliers present in the data as it causes disturbance during PCA
Picture 2

Objectives of PCA

  1. PCA reduces attribute or characteristic space from a larger set of variables into a smaller set of factors and does not depend on the dependent variable to be specified, or in other words, PCA is a ‘non-dependent’ type of procedure.
  2. PCA is a data compression or dimensionality reduction technique. The goal of PCA is to reduce the space between the variables and there is no guarantee that the dimensions are interpretable
  3. PCA basically creates a subset of variables from the main set based on which all variables from the main set have the highest correlation with the prime components

Eigenvalues & Eigenvectors

The principal components are linear combinations of the original variables weighted by their contribution to defining the variance in a particular orthogonal dimension.

Eigenvalues are also known as characteristic roots. The general aim of an eigenvalue is to measure the variance for a given factor in all the variables accounted for by that factor.

If the eigenvalue of a certain factor is low, the that means the contribution of that factor in explaining the variances in the variables is low and such values might be ignored.

A factor’s eigenvalue is the sum of its squared factor loadings for all the variables.

Implementation of PCA in R

  • First is spectral decomposition which examines covariances or correlation between variables.
  • Second is singular value decomposition which examines covariances or correlation between individuals.

For the spectral decomposition method we use the princomp() function and for singular value decomposition, we use prcomp() function.

Below is the sample simplified code for these two functions:

Picture 3

Where x = numeric matrix or data frame,

Scale = whether the variables should be scaled,

Cor = If TRUE, the data will be scaled and centered before analysis,

Score = if TRUE, the coordinates on each principal components are calculated

Now let’s look at an example,

Step 1: Load the dataset

Picture 4

Step 2: Check the covariance of the data

Picture 5

Step 3: Calculate eigenvalues and vectors

Picture 6

Step 4: PCA

Picture 6

Step 5: Compare the output variances

Picture 7

Step 6: Compare the eigenvectors

Picture 8

Step 7: Check the summary

Picture 9

Step 8: Visualization

The most important part of doing PCA in R is visualization. It gives the user a more in-depth idea about the variables

There are various visualization techniques. What we are going to use is biplot.

Picture 10
Picture 11

The x-axis represents the first principal component and the y-axis represents the second. From the graph, we can understand that petal length and petal width are parallel to the x-axis and hence they are completely combined and transformed.

Conclusion