Dimension reduction using PCA in R

Sam Yang
6 min readSep 8, 2021

--

What is PCA?

Principal Component Analysis (PCA) is one of the most popular methods for reducing the dimensionality of large amounts of data. PCA computes and projects the original data onto new orthogonal axes which are also called principal components (PC), The new PCs differ from the original dimensionality in that only a few leading PCs contain the dominant information of all the data.

High-level understanding of PCA steps

Step 1: Standardisation

In this step, the ranges of the initial variables are standardized so that they have the same effect on the analysis. The purpose of standardization is to prevent a larger range of variables from being dominant in a smaller range of variables.

Step 2: Compute covariance matrix

The covariance matrix is computed to remove redundant information

Step 3: Compute the principal components of the data.

The principal component is a new variable formed by a mixture of initial variables. The principal components are formed such that they are uncorrelated. Usually, the first few components contain most of the information, so users can effectively reduce the number of variables. The outcome is a matrix of Feature Vectors corresponding to the principal components

Step 4: Reorient the data

Based on the above PCA computing, the original data were recast along the axes formed by the PCs.

Use PCA for dimensionality reduction

The process of reducing the number of input variables in the model is called dimensionality reduction. The fewer input variables, the simpler the prediction model. The dimensionality reduction leads to a more concise forecasting model. Principal component analysis can automatically perform dimensionality reduction and give the main factors. The features of the original data can be reconstructed from the principal components computed by PCA. The value of the new features can be used to train machine learning models more effectively.

Detailed illustration of how to perform PCA in R

R and Matlab are both very sophisticated statistics and graphics packages. Matlab finds more use in industry, while R has a dominant role in academia. Both R and Matlab provide a wide variety of statistical tools for linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, and more. In the following step-by-step illustration, we show how to perform PCA using R language. The same workflow works for Matlab and Python.

1. Compute statistics of the data

Given any set of data, we are interested in the common metrics of these data such as average, median, minimum and maximum, and standard deviation. R provides an easy way to import and compute statistics by using sapply function. The following script computes the min, max, median, mean and standard deviation of the data:
sapply(data,min)
sapply(data,max)
sapply(data,median)
sapply(data,mean)
sapply(data,sd)

2. Input and preparation of data

R processes the data in a special structure called frame. In this article, we follow the literature convention to use df to represent the data frame structure. The R code below shows how to import data from a csv file, store it in the data frame format, and convert it to tabular format using table function.
data.df <- read.csv(“Initial_raw_data.csv”)
Table (data.df$CHAS)

3. Use aggregate to tabulate counts

Aggregate is a function in base R which can aggregate the input data.frame by applying a function specified by the “FUN” parameter to each column of sub-data.frames defined by the “by” input parameter. The “by” parameter has to be a list. The sub-data.frames defined by the by input parameter can be thought of as logical indexing. Aggregate always returns a data frame as a result. This data frame will contain the results of the call to the function in the FUN parameter applied to the input data frame. The following example code illustrate how to use aggregate function.
aggregate(x = values, by = list(unique.values = values$value), FUN = mean)

4. Use functions melt and cast for pivot tables

“Reshape” is an R package that makes it easy to transform data between wide and long formats. Melt and cast are two common functions for reshaping the table of data. Melt is used to stack a set of columns into a single column of data, and cast is used to reshape the data and generate pivot table. In another word, “melt” takes wide-format data and melts it into long-format data, while “cast” takes long-format data and casts it into wide-format data.The name “melt” and “cast” mimics what happens to the metal when heated: if you melt metal, it drips and becomes long, i.e. “melt”, and if you “cast” it into a mould, it becomes wide. For example, the following code calls melt method. “Month” and “day” are the two columns in the original data which are used as index ID in melt(), and “MEDV” is the column from the original data to be stacked by melt().
melt(data, id=c(“month”, “day”), measure=c(“MEDV”))

5. Use PCA to reduce a set of numerical variables

Often the input data contains too much information, most of which could be removed without affecting the core information. PCA is often employed to create new variables that are linear combinations of the original variables. These linear combinations are uncorrelated, which means no information overlap. These new variables are called principal components. For example, we are given an input data frame containing various categories of information (e.g. manufacturer, cold or hot type, calories, protein, fat, sodium, fiber, sugars, potassium, vitamins, weight, and consumer ratings) for different brands of breakfast cereals. PCA first computes the covariance matrix for all the features and brands. The larger co-variance implies more information. Consider the simplest case of two features: calories and customer rating. The 2x2 covariance matrix is:

Total variance = 379 + 197 = 576, and the feature calories accounts for 379 / 576 = 66%. Using single feature calories would lose 34% of the total variation.

Now we perform PCA for the above 2x2 matrix using the R function prcomp(), and we obtain a new 2x2 matrix whose features are labelled by PC1 and PC2. The relation between the original two features (calories and ratings) and the two new PCs are represented by the weight table below:

What makes the new feature table more favorable than the original feature table is that the proportion of variance for the leading PC becomes more dominant. E.g., for the above case, the variance computed for the two PCs are:

In another word, PCA converts the original features (calories and rating) to the new features (PC1 and PC2) where PC1 accounts for 86% of the total variances, i.e. using single feature PC1 would only lose 14% of the total variation. The advantage of PCA becomes more obvious when the original data consists of numerous features. We repeat the same prcomp() functions for the original data and compute upto 7 PCs. The corresponding proportion of variance are listed below:

It can be clearly observed that the proportion of variance decreases for the remaining PCs, and the first two PC account for (53.9+38.7)% = 93% of the total variance. Using PCA to convert the original features to 7 new PCs, and keeping only PC1 and PC2 would retain 93% of the total variance. This greatly simplifies the computation and analysis without losing much fidelity.

  1. Apply PCA in classification and prediction

Based on the analysis of PCA above, we now illustrate how to use PCA for classification. In practice, we often divide the complete data into 2 sets: training set and validation set. In order to perform PCA on the complete set of data, we follow the simple steps: (1) apply PCA to training data, (2) decide how many PCs to use, (3) using variable weights to create a new reduced set of predictors and apply it to the validation data.

  1. Summary

For most data analysis projects, data summarization is the key, which includes computing numerical metrics (average, median, etc.) and graphical visualization. For both analysis and visualization purposes, dimensionality reduction is useful for compressing the information in the data into a smaller subset. Principal components analysis is often used to transform an original

set of numerical data into a smaller set of weighted averages of the original data that contain most of the original information in fewer variables.

--

--