Member-only story
Principal Component Analysis From Scratch in Python
One of the most important algorithms in data science
Principle Component Analysis(PCA), whilst being invented more than a century ago, has proven itself to be one of the most important and widely used algorithms in modern data science. With applications spanning visualization of high dimensional data, unsupervised learning and dimensionality reduction, its broad appeal has meant that it has become a mainstay in numeric computing and AI software libraries alike. While these implementations are often ubiquitous, free and efficient, their ease of use has meant that the intuition behind how the algorithm works is often skipped over in favour of instant gratification. In this article, I aim to revisit that intuition with a simple yet informative implementation in python and provide an example of how it can be employed in a data science pipeline.
What is it?
It is defined as the orthogonal transformation of the data into a series of uncorrelated principal components such that the first component explains the most variance in the data with each subsequent component explaining less.
This technique is particularly useful in processing data where multi-colinearity exists between features or when the dimensions of features are high.