Geek Culture
Published in

Geek Culture

Beginners guide to Principal Component Analysis- Dimensionality Reduction

Photo by Myriam Jessier on Unsplash

Do you have too many variables for your machine learning problem, and confused about how to get the best from those variables?

The principal Component analysis is probably the one you must consider. The basic intuition behind PCA is that you will use mathematics to extract important variables from a large pool. Basically, it combines highly correlated variables together to form a smaller data set. In other words, it is called “Principal Components” which consists of variables that account for maximum variance in the data.

The first step is to subtract out the mean from each sample data(Variable).

Normalized Data

The next step is to find out the covariance matrix for the above data. The formula below is applied

Covariance formula

Next, we compute the covariance matrix and finally compute the eigen vectors and eigen values. Once we get the eigen parameters, we can perform reorientation and plot the values. This could give us an idea of how closely the variables are related.

Although PCA is used and has sometimes been reinvented, it is, at heart, a statistical technique in many disciplines and hence much of its development has been by mathematicians.

This means that ‘preserving as much variety as possible’ translates into finding new variables that are linear functions of those in the original data set, that successively maximize variance and that are uncorrelated with each other. Finding such new variables, the principal components (PCs), reduces when solving for eigenvalue/eigenvector problem. The earliest literature on PCA dates from Pearson [1] and Hotelling [2], but it was not until electronic computers became widely available decades later that it was computationally possible to use it on datasets that were not trivially small. Since then, its use has increased tenfold, and many variants have been developed in various different disciplines. Substantial books have been written on the subject [3,4]. The main uses of PCA are descriptive, rather than inferential; an example will illustrate this.

Too much math? I will now provide a python notebook from my GitHub repository that performs PCA for a dataset that is imported from sklearn.

Percentage of Variance for each by Principal component
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])

Converting into standard scalar and initializing the number of components we need (In this example, I will reduce the variables to 2 PCs)

Correlation between the two variables

References

  1. Pearson K. 1901 On lines and planes of closest fit to systems of points in space. Phil. Mag. 2, 559–572. (doi:10.1080/14786440109462720) Crossref, Google Scholar
  2. Hotelling H. 1933Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 24, 417–441, 498–520. (doi:10.1037/h0071325) Crossref, Google Scholar
  3. Jackson JE. 1991A user’s guide to principal components. New York, NY: Wiley. Crossref, Google Scholar
  4. Jolliffe IT. 2002Principal component analysis, 2nd edn. New York, NY: Springer-Verlag. Google Scholar

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store