We live in a 3 dimensional space. Things around us have a width, breadth and depth. We see the movies and paintings in 2 dimensions and certain observations just have a single dimension. Lower dimensions are easy for us to visualize. But it gets harder to think about and visualize things as the dimensions increase. Wrapping our head around a n-dimensional quantity is a huge challenge. A n-dimensional quantity is not uncommon when we are dealing with data. Wouldn’t it be nice if we could somehow squeeze our data from higher dimensions to lower dimensions without loosing much information? Lets see how can we do that.
You are a newbie in the art of wine making and wine appreciation and are trying to learn about different types of wines and classify them depending on their properties. You can use many characteristics or properties of the wine like color, its body, odor, age etc. to describe each wine.
The wine here is our object which we are trying to describe through its features. This object is being described in terms of n different variables or n-dimensions. Of all these characteristics not all define the wine in a unique way i.e. lot of the characteristics would be related to each other and hence are redundant for instance the age of the wine and its odor may be related to each other. If so, shouldn’t we describe each wine with a fewer set of characteristics? This is what Principal Component Analysis or PCA does. PCA is essentially a method to summarize some data.
Dimensionality reduction and PCA
So how do we reduce the set of characteristics or features? We can straightaway drop or eliminate some of the features. For example we can just choose 2–3 best characteristics for wines which you think will best describe the wine. Advantages of feature elimination is simplicity and interpretability of your features. The downside however is you loose some potential information from those dropped variables.
However PCA is not about selecting some characteristics and discarding the others. Instead, it constructs some new characteristics that turn out to summarize our data well, the list of wines in our example above. Of course these new characteristics are constructed using the old ones; for example, a new characteristic might be computed as wine age minus wine acidity level or some other combination like that (linear combinations). Principal component analysis is a technique for feature extraction. It combines our input variables in a specific way, then we can drop the least important variables while still retaining the most valuable parts of all of the variables! As an added benefit, each of the new variables after PCA are all independent of one another. Lets look at our wine example a bit to understand this better.
Lets say you are looking for some wine characteristics or features that strongly differ across wines, lets say something like the tannin levels in the wine. But imagine that you come up with a property that is the same for most of the wines. This would not be very useful, wouldn’t it? Wines are very different, but your new property makes them all look the same! This would certainly be a bad feature of the wines. PCA looks for properties that show as much variation across wines as possible.
The other way to look at it is that you look for the properties that would allow you to reconstruct the original wine characteristics. Again, imagine that you come up with a newly constructed property that has no relation to the original characteristics; if you use only this new property, there is no way you could reconstruct the original ones! This, again, would be a bad feature. So PCA looks for properties that allow to reconstruct the original characteristics as well as possible.
We should be able to get as much variation as possible in our new properties and we should be able to reconstruct our original properties from them. They might seem two different processes but lets see how they are tackling the same thing and are not two different goals. Look at the figure below, lets say each of the point below represents a wine, the x axis determines the acidity level and the y axis represents the tannin levels.
We don’t know whether they are related or not but lets assume that they are. You can see from the plot above that there seems to be some relation between the two. There is a general trend with which all the points are spread. Totally uncorrelated points do not show any trend.
Let us draw an arbitrary black line through the set of points and drop perpendicular lines to this newly constructed black line from all of the points. The point where these red lines touch the black line is called the projection of that point on the line. This projection is the new property or characteristic which the PCA will construct. This projection will be a linear combination of the original variables x and y and will be of the form ax+by. How do we find such a black line where these projections have the maximum spread or variance and which also allows us to reconstruct our original points from the projections easily.
Look at the makeshift animation above for a while. As the black line rotates the spread of the projections also change on the black line. If you observe closely the maximum spread or variation of the red dots occur when the black line is roughly at 2 o’clock. The red lines perpendicular to the black line depict the distance of the projection point from the blue point which is also the error in reconstruction of the original point from the projection. Stare at the animation for a few minutes and you will notice that the sum of the distances of the projection also occurs when the black line is at 2 o’clock. This sum of distances is also the total reconstruction error. PCA finds this black line and the projections for which the variance of the projections is maximum and the reconstruction error is minimum. Two birds with one stone!
You can now use these red projections as your new features for your machine learning algorithm. For our wine example PCA has reduced 2 dimensional data to a single dimension. This technique can be used for higher dimensional data to project it on to a lower dimensional space which makes visualization much easier.
The above introduction to PCA merely scratches the surface of a rather powerful dimensionality reduction technique. PCA is highly used in data science and Machine learning to extract orthogonal features out of high dimensional data. I have deliberately tried to avoid the mathematics and rigor of the PCA methodology. I would encourage you to look it up and go through the mathematics as well and understand eigenvectors and eigenvalues. Below are some great resources to understand PCA and related topics.
- I am a big fan of 3blue1brown and this video series about linear algebra takes your understanding of matrices and its operations to the next level
- This website has a succinct and nice visual description about PCA
- The same website with explanation about eigenvalues and eigenvectors
- True to its title Matt Brems in his blog A One-Stop Shop for Principal Component Analysis has beautifully explained everything related to PCA
X8 aims to organize and build a community for AI that not only is open source but also looks at the ethical and political aspects of it. More such simplified AI concepts will follow. If you liked this or have some feedback or follow-up questions please comment below