Principal Components Analysis

Rishabh Mall
4 min readJan 14, 2019

--

Principal Component Analysis (PCA) is a dimension-reduction tool that can be used to reduce a large set of variables to a small set that still contains most of the information in the large set. It allows us to take an nn-dimensional feature-space and reduce it to a kk-dimensional feature-space while maintaining as much information from the original dataset as possible in the reduced dataset. Specifically, PCA will create a new feature-space that aims to capture as much variance as possible in the original dataset.

The ultimate goal with dimensionality reduction is to find the most compressed representation possible to accurately describe the data. If all of the features within your dataset were completely independent signals, this would be very hard. However, it’s often the case that you will have redundancies present in your feature-space. You may find that many features are simply indicators of some latent signal that we haven’t directly observed. As such, we would expect that indicators of the same latent signal would be correlated in some manner. For example, if we were trying to reduce the feature-space of a dataset that contained information about housing prices, features such as number of rooms, number of bedrooms, number of floors, and number of bathrooms all might be indicators of the size of the house. One might argue that these latent signals are the principal components which make up our dataset.

Let’s look at the following two-dimensional feature-space.

It appears that most of the points on this scatterplot lie along the following line. This suggests some degree of correlation between x1x1 and x2x2.

As such, we could reorient the axes to be centered on the data and parallel to the line above.

We’re still dealing with two dimensions. However, let’s project each observation onto the primary axis.

Now, every observation lies on the primary axis.

We just compressed a two dimensional dataset into one dimension by translating and rotating our axes! After this transformation, we only really have one relevant dimension and thus we can discard the second axis.

Comparing the original observations with our new projections, we can see that it’s not an exact representation of our data. However, one could argue it does capture the essence of our data — not exact, but enough to be meaningful.

Let’s take another look at the data. Specifically, look at the spread of the data along the green and orange directions. Notice that there’s a much more deviation along the green direction than there is along the orange direction.

Projecting our observations onto the orange vector requires us to move much further than we would need to for projecting onto the green vector. It turns out, the vector which is capable of capturing the maximum variance of the data minimizes the distance required to move our observations as projection onto the vector.

Factor Analysis

Factor analysis is a statistical procedure to identify interrelationships that exist among a large number of variables, i.e., to identify how suites of variables are related. Factor analysis can be used for exploratory or confirmatory purposes. As an exploratory procedure, factor analysis is used to search for a possible underlying structure in the variables. In confirmatory research, the researcher evaluates how similar the actual structure of the data, as indicated by factor analysis, is to the expected structure. The major difference between exploratory and confirmatory factor analysis is that researcher has formulated hypotheses about the underlying structure of the variables when using factor analysis for confirmatory purposes. As an exploratory tool, factor analysis doesn’t have many statistical assumptions. The only real assumption is presence of relatedness between the variables as represented by the correlation coefficient. If there are no correlations, then there is no underlying structure.

Steps in conducting a factor analysis :

There are five basic factor analysis steps:

  • Data collection and generation of the correlation matrix
  • Partition of variance into common and unique components (unique may include random error variability)
  • Extraction of initial factor solution
  • Rotation and interpretation
  • Construction of scales or factor scores to use in further analyses

--

--