A Gentle Introduction into the Application of Principal Component Analysis (PCA) in Genomics

Ransford Dimitri Kisten
The Startup
Published in
6 min readApr 9, 2020

One of my passions in the broad field of biology and medicine has always been genomics. I strongly believe the best treatment of today and tomorrow will be personalized medication based on one’s genetic makeup. I am currently taking John Hopkins Genomic Data Science Specialization course which starts with instruction from Dr. Steven Salzberg, one of the scientists who worked on the Human Genome Project, and current Director of the Center for Computational Biology at John Hopkins. One of the methods he gives a glimpse of early on is Principal Component Analysis (PCA) to recognize differences in genetics between populations. PCA was something I learned at General Assembly’s Data Science Immersive program however was not aware of the applications in genomics. In this article I will provide a brief description as to what Principal Component Analysis is and how it works, then I will outline how it is applied in the field of genomics.

Obviously math and statistics are vital to the implementation of these techniques but my hope is that perhaps someone without a math background, but a strong genetics background, might read this and be able to apply their ideas and further explore this topic. This article will cover a basic explanation of PCA and offer resources at the end of the section to explore the mathematics behind the algorithms for those interested in a more complex dive.

What is Principal Component Analysis (PCA)

First and foremost, Principal Component Analysis, commonly abbreviated as PCA, is used in data science when we have lots of variables. For example, we might want to build a model to predict if a person will have diabetes or some other disease. The variables we use to predict this can be age, height, weight, ethnicity, family history, blood markers, level of activity, anything we deem relevant. It’s easy to see how we can arrive at a long list of potential variables, however this also introduces several problems when modeling.

Primarily we have two main problems:

  1. Not understanding the relationships between variables. Variables could be collinear. When variables are collinear it means they are highly correlated to each other. For example, as height goes up so does weight. This causes problems in our model by making it very sensitive to minor changes in our variables and also making interpretation of our variables difficult. Interpretation of our variables, for example saying that for every additional 10 lbs, the chance of having X disease goes up 2%, is extremely important especially in medicine and healthcare; multicollinearity makes these kinds of interpretations unstable.
  2. Overfitting our model. This occurs when we have variables that are too specific to the set of data we are using to model (training set). This means that our model will then have trouble making predictions on data outside of the training data and cannot be generalized for a population.

To avoid these problems we can do several things. The first is to drop variables/features from our model. This is tricky because then we open up a can of worms regarding choosing which variables to drop. Additionally, this is best done by people who know which variables matter and which don’t, so if you are not an expert on the subject, this might be difficult.

How does PCA solve our problems?

PCA accomplishes this by reducing the amount of variables or dimensions through combining features or feature extraction.

Feature extraction is simply converting our old features into new features which are combinations of our old features. To accomplish this, we summarize how X variables are related to each other and then we see which combinations of these X variables explain the most variability or variation in our data, or which combinations are the most important. We then drop the combinations that are the least important.

The pros and cons of doing this are:

Pros:

  • Addresses multicollinearity and over fitness.
  • Increasing computational speed.

Cons:

  • Cannot interpret our coefficients. This isn’t great for medical models when we want to interpret coefficients, however if it improves the predictive power of a model, this isn’t a bad trade off especially since we can always do other additional studies to see the effect of individual variables.

To quickly summarize. Principal Component Analysis (PCA) is used to reduce dimensions (variables) in our model when we have a large number of variables or collinear variables. It accomplishes this by recognizing the important relationships between variables and combining these variables. The importance is then quantified so we only keep the most important transformed variables. This should not be done if you want to maintain the interpretability of variables within a model.

For those who want a more advanced mathematical explanation, this Medium article provides a great explanation.

How is PCA used in genomics?

The methodology of how PCA is used in genomics is quite simple without diving into the math. In genomics a common area of study is the study of variation among different populations. Although 99.9% of DNA is identical between humans, the .1% difference is what makes up each individual’s uniqueness between and within populations. If I want to see variation between two humans I can simply look at their genotypes for a certain trait. I may find that individual one has a ‘tall’ genotype and individual two has a ‘short’ genotype. This can get very complex if we want to look at a whole population and many other traits. If we were to map out genetic differences between populations over multiple genotypes/variables this would be extremely difficult as our genes code for hundreds and thousands of different traits and functions.

How does PCA help organize genetic differences?

From our understanding of PCA and how it groups variables together, or in this case, genotypes, we can reduce and group together variables for different populations. We are able to keep our original genetic variation, however visualize it in a way that provides meaningful insight. This is extremely useful especially since studies involving population genomics can be in the millions today as genetic testing has become more voluntarily used, which then span hundreds to thousands of genotypes.

This can be visualized as shown in Figure 1 from John Hopkin’s Genomic Data Science course. This study used genome sequences from Europeans to plot the genetic differences along the two axes. PCA reduced all the genetic differences two two.

Figure 1: (Credit: John Hopkins Introduction to Genomic Technologies Week 1: From Genes to Phenotypes Lecture by Dr. Steven Salzberg.)

What does PCA represent biologically?

From the figure above we see that people from the same country (data points are colored coded by country) tend to be similar genetically as shown by the clustering of colors. This is synonymous with evolutionary genetic patterns as shown through the phylogenetic tree below, Figure 2. Variation among races can explicitly be seen within our genome. PCA offers a way to visualize this variation and can be used in a variety of different ways besides extracting differences between the entire human population. If you wanted to visualize specific differences in traits or diseases this would be a very useful tool as well.

Figure 2

In summary, PCA is a quick way to reduce complex datasets with many variables, common in the field of genomics, into visually simple representations of genetic variation. While this can obviously be done painstakingly by hand or by other computer programs, the visualization offered by PCA is very powerful as it can quickly display patterns in the data and guide further exploration.

--

--