Principal Component Analysis-Finding Principal Components, Variance and Standard Deviation calculations of principal components. (Using RStudio) (Wine Dataset)

Link to the program and Datasets is given below

Kshitij Ved
Analytics Vidhya
5 min readMar 24, 2020

--

Pic Credit:- Thinksprout Infotech

What are the Principal Components?

Principal components are the underlying structure in the data. They are the directions where there is the most variance, the directions where the data is most spread out. There are multiple principal components of data, each representing the different variance of the data. They are arranged in chronological order of variance. The first PC will capture the most variance i.e. the most information about the data, followed by the second, third and so on.

What is Principal Component Analysis?

Principal Component Analysis or PCA is one of the simplest and fundamental techniques used in machine learning. It is one of the oldest techniques available for dimensionality reduction, and thus, its understanding is of supreme importance for any aspiring Data Scientist or Analyst. An in-depth understanding of PCA in R will not only help you in the implementation of effective dimensionality reduction but will also help you to build the foundation for development and understanding of other advanced and modern techniques.

PCA aims to achieve two primary goals:

1. Dimensionality Reduction

Real-life data has several features generated from numerous resources. However, our machine learning algorithms are not proficient enough to handle high dimensions efficiently. Feeding several features, all at once, almost always leads to poor results since the models cannot grasp and learn from such volume altogether. This is called the “Curse of Dimensionality” which leads to unsatisfactory results from the models implemented. Principal Component Analysis in R helps resolve this problem by projecting n dimensions to n-x dimensions (where x is a positive number), preserving as much variance as possible. In other words, PCA in R reduces the number of features by transforming the features into a lesser number of projections of themselves.

2. Visualization

Our visualization systems are limited to 2-dimensional space which prevents us from forming a visual idea of the high dimensional features in the dataset. PCA in R resolves this problem by projecting n dimensions to a 2-D environment, enabling sound visualization. These visualizations sometimes reveal a great deal about the data. For instance, the new feature projections may form clusters in the 2-D space which was previously not perceivable in higher dimensions.

Program for Principal Component Analysis using wine dataset-

Step 1: Load the required dataset.

We can see in the above code how the dataset can be imported in RStudio. After importing the dataset successfully, the attach() function is used to attach the attributes of the datasets so that they can be used in the program recursively without mentioning the dataset explicitly. The below image shows the output of the above code.

Step 2: Binding the names of the columns in the dataset.

Input:

The ‘cbind’ is used to take a sequence of vector, matrix or data-frame arguments and combine by columns or rows, respectively. These are generic functions with methods for other R classes.

Output:

Step 3: Applying the principal component on the matrix

Input:

The princomp performs a principal components analysis on the given numeric data matrix and returns the results as an object of class princomp.

Output:

Step 4: Showing the analysis on the graph

Input:

A Scree Plot is a simple line segment plot that shows the fraction of total variance in the data as explained or represented by each PC. The PCs are ordered, and by definition are therefore assigned a number label, by decreasing order of contribution to the total variance. The PC with the largest fraction contribution is labeled with the label name from the preferences file. Such a plot when read left-to-right across the abscissa can often show a clear separation infraction of total variance where the ‘most important’ components cease and the ‘least important’ components begin. The point of separation is often called the ‘elbow’.

A Principal Components Analysis Biplot (or PCA Biplot for short) is a two-dimensional chart that represents the relationship between the rows and columns of a table. In Q, PCA biplots can be created using the Maps dialog box, which generates the biplot in Excel or PowerPoint, or by selecting Create > Dimension Reduction > Principal Components Analysis Biplot, which generates an interactive version of the chart using R.

Output:

Click here to download the Program and Datasets…

--

--