PCA for Bioinformaticians
How to use Principal Component Analysis in Computational Biology
Principal Component Analysis (PCA) is one of the most popular dimension reduction techniques in Machine Learning and Statistics. PCA is very useful in order to visualize data in human-readable formats, such as 2D plots for better understanding of high-dimensional data. We can obtain the same set of advantages in the domain of Bioinformatics by applying PCA on our genomic/proteomic datasets. In this article, I will explain how one might use PCA to differentiate between two species using the species’ oligonucleotide frequency vectors.
Simulating Reads
In order to demonstrate the use of PCA, first, we shall download two species. In this article I will be using the two species;
- Staphylococcus aureus (Nuccore ID
NC_007795.1
) - Pseudomonas aeruginosa (Nuccore ID
NC_002516.2
)
In order to vectorize these, we need multiple points so that they can be visualized to see clusters. Otherwise, we’ll only see two points in space with much less information. This can be achieved by the simulation of reads. However, when the reads are short (Illumina lengths) the oligonucleotide frequency vectors carry more errors. Therefore, I will be simulating PacBio reads using SimLoRD to simulate 100 reads from each species with a minimum length of 1500bp which is reasonable for PacBio. Furthermore, I will use the default error rate of SimLoRD to see if we can see two clusters for noisy long reads. I use the following commands.
simlord --no-sam -rr SA.fasta -n 100 -mr 1500 SA
simlord --no-sam -rr PA.fasta -n 100 -mr 1500 PA
This will output two fastq (SA.fastq and PA.fastq)files each containing 100 reads with quality scores.
Obtaining Oligonucleotide Frequency Vectors
There is no ideal value for the number of nucleotides to obtain k-mers, though having a larger k for long reads can increase the effect of noisy bases. Having the noise of errors in mind, we can set k to one of 3, 4 or 5. In this article, I will use the trimer (3-mer) frequency vectors. If you read the following article, you’ll find a complete program to do this vectorization very fast. I will be using this program to obtain the frequency vectors for this experiment.
I compile the above program using the following command;
g++ vectorize.cpp -o vectorize -fopenmp --std=c++17
I run the following commands to get the vectors for each species’ fastq reads.
./vectorize ./SA.fastq 3 8 100 > ./SA.txt
./vectorize ./PA.fastq 3 8 100 > ./PA.txt
Computing PCA and Plotting on 2D Plane
Now that we have vectorized the sequences from long reads, we can use Python’s Sci-Kit learn library to obtain the PCA decomposition. I will use the Seaborn library for Python to visualize them on the 2D plane.
# Imports
import numpy as np
import seaborn as sns
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt# Reading data
sa = np.loadtxt("SA.txt")
pa = np.loadtxt("PA.txt")# Preparing labels
data = np.append(pa, sa, axis=0)
labels = ["Staphylococcus aureus" for x in range(100)]
labels += ["Pseudomonas aeruginosa" for x in range(100)]
labels = np.array(labels)# Run PCA
pca = PCA(n_components=2)
data_2d = pca.fit_transform(data)# Plotting using Seaborn
fig = plt.figure(figsize=(10, 10))
sns.scatterplot(data_2d[:,0], data_2d[:,1], hue=labels)
plt.xlabel("PCA 1")
plt.ylabel("PCA 2")
plt.savefig("PCA.png")
The above code will plot us the following diagram.
We can clearly see that the two species are separately identifiable as two clusters. This is an important indicator in the metagenomics binning where we have several unknown species in a sample. It is also worth knowing that despite the higher error rates in long reads, we can still see the separation.
However, this will not be the case for closely related species. Hence, this is not complete and the only solution for species clustering. Yet the PCA analysis will give valuable insights into the composition of a sample.
Possible Interpretations
- There are two different strains/species with distinct compositions.
- By observing the size of the clusters we can say how the abundance varies. Small clusters will show smaller abundance. In this sample, we have equal-sized clusters since we simulate 100 reads from each reference.
- We can observe how the genomic signature patterns vary within the species. For example, Pseudomonas aeruginosa has a less sparse cluster than Staphylococcus aureus. This means that Staphylococcus aureus has a greater variance of the signature pattern along the genome than that of Pseudomonas aeruginosa.
I hope this article would bring some light into your work. Thanks for reading. Cheers!