Principal Component Analysis
Introduction to PCA
When we work on a huge dataset, we usually see a large number of variables scattered with huge amounts of variances among them which makes it difficult to work with and in turn reduces the efficiency of our model. When working with such kind of large datasets, it’s nearly impossible and exhausting to individually engineer every variable. That’s when principal component analysis comes into play.
Multivariate analysis often starts with a huge number of correlated variables. The principal component analysis is a dimension-reduction tool that can be used to reduce the overall size or dimension of the variable set while still managing to keep the most important information within the main set.
The principal component analysis follows a set of statistical mathematical procedures to transform a set of correlated variables into a set of unrelated variables called principal components.
Traditionally principal component analysis is done on a symmetric square matrix and is often confused with a similar multivariate analysis procedure called Factor Analysis.
It can be a pure sum of squares and cross-product or SSCP matrix, scaled sum of squares and cross product Covariance matrix, the sum of squares and cross products from standardized data i.e. Correlation matrix
- Sample size: Ideally, there should be at least 150 cases and there should be a ratio of a minimum five cases for each variable
- Correlations: Some kind of correlation must exist between the variables or factors to be considered for PCA
- Linearity: A linear relationship must exist between the variables
- Outliers: There should be no outliers present in the data as it causes disturbance during PCA
Objectives of PCA
- PCA reduces attribute or characteristic space from a larger set of variables into a smaller set of factors and does not depend on the dependent variable to be specified, or in other words, PCA is a ‘non-dependent’ type of procedure.
- PCA is a data compression or dimensionality reduction technique. The goal of PCA is to reduce the space between the variables and there is no guarantee that the dimensions are interpretable
- PCA basically creates a subset of variables from the main set based on which all variables from the main set have the highest correlation with the prime components
Eigenvalues & Eigenvectors
Eigenvectors also known as principal components reflect both the unique and common variances of the variables and is generally considered to be a variance focused approach to reproduce both the total variable variance with all components and to reproduce the correlation.
The principal components are linear combinations of the original variables weighted by their contribution to defining the variance in a particular orthogonal dimension.
Eigenvalues are also known as characteristic roots. The general aim of an eigenvalue is to measure the variance for a given factor in all the variables accounted for by that factor.
If the eigenvalue of a certain factor is low, the that means the contribution of that factor in explaining the variances in the variables is low and such values might be ignored.
A factor’s eigenvalue is the sum of its squared factor loadings for all the variables.
Implementation of PCA in R
We are going to work with the most widely used ‘Iris’ dataset. In R there are basically two methods to perform PCA:
- First is spectral decomposition which examines covariances or correlation between variables.
- Second is singular value decomposition which examines covariances or correlation between individuals.
For the spectral decomposition method we use the princomp() function and for singular value decomposition, we use prcomp() function.
Below is the sample simplified code for these two functions:
Where x = numeric matrix or data frame,
Scale = whether the variables should be scaled,
Cor = If TRUE, the data will be scaled and centered before analysis,
Score = if TRUE, the coordinates on each principal components are calculated
Now let’s look at an example,
Step 1: Load the dataset
Step 2: Check the covariance of the data
Step 3: Calculate eigenvalues and vectors
Step 4: PCA
Step 5: Compare the output variances
Step 6: Compare the eigenvectors
Step 7: Check the summary
Step 8: Visualization
The most important part of doing PCA in R is visualization. It gives the user a more in-depth idea about the variables
There are various visualization techniques. What we are going to use is biplot.
The x-axis represents the first principal component and the y-axis represents the second. From the graph, we can understand that petal length and petal width are parallel to the x-axis and hence they are completely combined and transformed.
Adaptation of PCA has been used in many cases such as binary data, ordinal data, compositional data, and even in cases such as time series or datasets with common covariance matrices
PCA has also played an important role in other statistical methods such as linear regression or simultaneous clustering of both individuals and variables.