Understanding PCA in NGS-based Studies

Faraz Ahmed
Byte Sized Machine Learning
3 min readMay 23, 2019

Building an Intuition for Principal Components Analysis

Principal Components Analysis (PCA) is a very common diagnostic feature that allows us to gain biological insights for our Next Generation Sequencing (NGS) based data sets. If you google PCA, I am sure you will find a plethora of tutorials, however, I have found that the basic understanding of dimension reductionality is not well relayed from these sources.

Recently I was trying to explain this to a room full of graduate students who are planning on doing RNA-seq (an NGS method typically used to infer differential gene expression across various groups) experiments in the near future, and I gave a rather trivial example to explain dimensional reductionality that unknowingly became the highlight of my talk. My example was super simple, however, it was a game changer for the students as it allowed them to view the dots on the PCA plot differently.

My Example: Let’s assume that we have two sequences of numbers of similar lengths and our goal is to understand how different these sequences are from one another. Now we have probably done this a million times in our K-12 classrooms, however, it turns out that most of us have forgotten what is happening. Now of course in a real-world example, you will probably need more than one representative for each of those sequences but let’s assume that each of these sequences best represents the primary populations that they are part of and a sample size of 1 is sufficient.

The answer you have most likely proposed to this is as follows: We take the average of each sequence and compare it to see how close or different these averages are from one another.

Now this exercise of taking the average of a given sequence is as simple as it can get to break down dimension reductionality in simple terms. When we take the arithmetic mean of a sequence of numbers, regardless of the length of the sequence, we will always end up with one number and one number only which will be the best representation of all the values that existed in that sequence.

In other words, we have taken an n-length of numbers and reduced them to a length of 1 by simply computing the most trivial function in mathematics, i.e. the arithmetic mean.

Now relaying this back to the PCA plot, let's first look at an example plot:

Here, the number of dots equals the number of samples you started with (in this case 6 samples). The coordinate position of each dot on these PC1 and PC2 axes is equivalent to the average number discussed in the example above. Each coordinate here on the PC1 and PC2 axis best represents the variation in that sample. These coordinates, also known as the principal components (PC), retain the variation of all the thousands of genes within a sample relative to all the other samples. By computing the PCs, we have reduced the number of gene dimensions of the data set and have pinpointed that the main source of variation in this study is due to treatment i.e. 58.24 + 40.83 = 99% of variation explained by the first two PCs. Of course, calculating each PC is not as simple as just computing the average, but I hope the intuition follows through!

--

--