Microbial Surveillance: The Future of Public Health

The analytical process of studying urban microbial data, an unexplored field brimming with opportunity

Dhruv Khurjekar
The Startup
9 min readAug 19, 2020

--

Urban microbial surveillance can play a crucial role in city planning and design, public health, and the discovery of new species.

Background

While a New York Subway station is bustling with swarms of businessmen, students, artists, and millions of other city-goers every day, its floors, railings, stairways, toilets, walls, kiosks, and benches are teeming with non-human life. The microbial ecosystem, or the complex web of relationships that microorganisms have with one another and with the environment, is omnipresent in shared public transportation spaces. The PathoMap, a 2013 project led by Cornell Professor Christopher Mason, was the first of an annual series of DNA collection projects in NYC. A sampling from subways found that only half of the DNA matched known organisms. This initial project confirmed that the urban microbiome was still a relatively unexplored field, virtually begging for researchers to seize the opportunity. The success of the project yielded the creation of the MetaSUB international consortium, and ever since, intensive studies have been carried out with microbial samples from urban locations around the globe. In addition to being a relatively new field of work, this project has many applications, and analyses like the ones below are being used to learn crucial information about COVID-19.

But what happens after the samples are collected? As I learned this summer, there is no magical one-step formula that outputs clean, categorized, analyzed, and graphed data. I had the opportunity to work with MetaSUB microbial data and learn the painstaking yet satisfying process of biological data manipulation and visualization.

How it Works

Collecting & Cleaning the Data in Linux Bash Terminal

Before the analysis is run, swab samples are collected from specified locations, and DNA fragment libraries are prepared and paired-end sequenced. Since both ends of the fragment are sequenced, this type of sequencing allows for more downstream analyses, such as better reference-sequence alignment, large structural variant detection, and de novo contig assembly.

While the biological sampling yields a plethora of data, not all of it is relevant for analysis. Adapter sequences, low-quality bases, and human DNA were all extraneous data points in the set. For my project, the first goal was to clean and categorize this data in the Linux Bash Terminal before conducting statistical and visual analyses in R Studio.

We used the Linux Bash Terminal for reading and cleaning the data. Read files are stored in a FASTQ format (as seen in Figure 1), and the following is the structure of this file type:

  • Line 1 of the file contains the identifier of the sequence which summarizes where the sequence of bases came from (e.g. sequencer and flow cell serial number)
  • Line 2 provides the actual raw letters of the nitrogenous base sequence
  • Line 3 starts with the plus sign and is a field where bioinformaticians can add additional metadata of their choosing
  • Line 4 is the last line of the FASTQ file, and it contains the quality score of each base in the format of an ASCII symbol, each symbol corresponding to a Phred Quality (Q) score

Figure 1

However, the raw files still have unwanted DNA fragments, including adapters (which are ligated by the template DNA molecules in the library preparation process) and low-quality bases (which can be identified using the ASCII or Phred Quality scores). The AdapterRemoval program — executed from the Linux command line — is used to remove these adapter sequences and trim the data of low-quality bases, as seen in Figure 2.

Figure 2

New FASTQ files are created from the trimmed reads after the sample reads are identified, the Adapter Removal function is used, the reads are filtered with a minimum quality score, and the files are unzipped as seen in the first line of code in Figure 2.

Although these sequences have been removed, the reads are still not thoroughly cleaned. For this project, only microbial data is needed. However, DNA samples from a public transportation space often contain substantial (~50–90+%) human DNA that straphangers leave behind since the bacterial genome is only about 0.1% the size of the human one. The bowtie2 aligner was used to identify human DNA sequences and discard them so they wouldn’t affect downstream analysis.

Figure 3

The bowtie2 aligner takes reads and tries to match them to a specified reference genome. Instead of taking each read and searching through all positions in the reference genome to check if it matches (which would yield a terrible run time), the algorithm utilizes an indexed reference sequence for greater efficiency. Many human reference genomes have been created for such purposes, and the most commonly used one is reference genome 38, which is used here. To provide an analogy for how the function works, the way I thought of this was like a book — to find a term or specific sentence in a book, it is far easier to check the index to see what page it is on rather than flipping through the full text. Likewise, suppose we provide bowtie2 with the information on the human reference genome. In that case, it will be much more efficient because the program will know where to go on the genome since the reference index narrows it down. New files were created that had cleaned reads.

After cleaning the data of unwanted reads (which could have meant either adapter sequences, low-quality bases, or human DNA), it is essential to identify the amount of data we will be using. To count the total number of reads (and we will count them at each stage of the process up until now) the zcat function is used to calculate the total number of FASTQ file lines and the output is divided by four since all reads are spread across the four lines of each FASTQ file.

Figure 4
  • Initial # of reads: 3,304,183 reads
  • # of reads after AdapterRemoval: 3,304,183
  • # of reads after Bowtie 2 Alignment: 3,297,531

Furthermore, to learn more about the data, the GC (guanine-cytosine percent) content is calculated using the QC-stats function, as seen in Figure 5.

Figure 5

The significance of the GC content is that there is a correlation between GC content and types of bacteria. Bacteria that might be specific to a particular biome may have comparatively lower or higher GC contents than those of other biomes. There is also a strong correlation between GC content and the stability of DNA, since G-C base pairs have one more hydrogen bond than A-T base pairs. Therefore, a higher GC content means a higher required temperature for denaturing the DNA, and this information is useful for PCR amplification and identifying the source or environment of the bacteria. Because our samples are metagenomic, we do not know the exact GC percentage to expect, so the value is used simply to make sure the two reads of the DNA have similar contents to validate that the library preparation worked. Next, using a program called Kraken, we taxonomically categorized and quantified the data and printed it in the Linux Terminal using a text editor called vim, as seen in Figure 6.

Figure 6

Kraken’s output is handy, as the actual classifications of the microbes can now be seen. In Figure 6, only a minimal number of the total classifications can be seen (that’s all that fit in the screenshot). The letters before the rankings are all abbreviations for Domain, Kingdom, Phylum, Class, Order, Family, Genus, and Species (the taxonomic hierarchy). The numbers at the end of each line represent the proportion of the corresponding species (or domain, phylum, etc. if the species was not identified) in the sample. The program is not able to identify many of the species because those are yet to be discovered or in our reference database.

Visual & Categorical Analysis in R

After finally cleaning and classifying the raw data, we move to R to visualize and categorize the data. Human Microbiome Project gastrointestinal and Mystery data was added to the project for comparison, and our new goal was to create visualizations and to conduct statistical analyses.

The first step in RStudio is to call the packages needed to visualize the data — ggplot2 is a package used to create all sorts of graphs and plots. The data is then read in using the read.csv function. Next, the geom_bar function of ggplot2 is used to create a barplot of the phylum classifications of the various samples:

Figure 7

As shown in the bar graph from Figure 7, there are three samples from each environment — the first three are gastrointestinal data, then there are three samples from a mystery environment, and the final three come from shared transportation spaces. Already, some distinctions are apparent between the three groups. Gastrointestinal samples contain far more Bacteroidetes than the two other environments. The first mystery sample almost entirely consists of Actinobacteria. Despite these initial observations, it is difficult to come to a conclusion on the mystery samples since they are only grouped broadly by the microbes’ phyla, as seen in line 4 of the code. But before we make a bar graph of more specific taxonomic groups, it is necessary to analyze the variance among the samples. Principal Component Analysis (PCA), which decreases the number of variables or dimensions in the dataset, makes it easier to see the variance and patterns among groups in the dataset. As seen in Figure 8, some lines of code are added to create this PCA plot, and the points are colored according to their environment.

Figure 8

A few conclusions can be drawn from the plot in Figure 8. One is that both the environmental (mass transit) and gut samples lack variance since points are clustered tightly within each group. Meanwhile, mystery samples are not nearly as correlated since their points are spread apart, which tells us that those samples come from an environment where the types of microbes vary. From the initial conclusions, one might be quick to suggest that the environment is one of soil, as this environment is known to contain large amounts of actinobacteria and is also very variable. However, this conclusion is likely flawed since a more specific grouping of the first bar graph is needed to compare to other studies. When ranked by genus, the taxonomic abundances of the mystery samples look far more similar to the one on the right, which comes from a research paper on skin pore taxonomic abundance, not soil. Even though Figure 9 comes from a study on the impact of pomegranate juice on human skin microbiota, we can compare its control groups with our data as well.

Figure 9

As it turns out, Propionibacterium and Staphylococcus were the most common microbes in the mystery samples. Still, we would lack this specificity without narrowing down the taxonomic grouping from phylum to genus. Indeed, human skin was the correct environment from which the mystery samples were taken.

Conclusions & Significance

Although the main visualization and analysis stages of this project were simplified compared to the actual processes of MetaSUB, projects like these bring light to the significance of the research, help participants explore this domain, and used to help train scientists/graduate students how to conduct bioinformatics analyses. With so many unknown species lurking in mass transit spaces around the world, studying the microbiome can spur innovation in the field. The possibilities of the research applications are endless — the data can play a crucial role in urban planning, city design, public health, and the discovery of new species. As the MetaSUB website says, the “data will… [enable] an era of more quantified, responsive, and smarter cities.” Especially with this past year’s COVID19 pandemic, researching and analyzing the microbial environments of spaces that billions of people share is all the more important, if not necessary. In fact, MetaSUB recently conducted an analysis very similar to the one above for Sars-CoV-2, the key difference being that they were dealing with an RNA virus.

Credits:

MILRD & Trelify for providing the data, a mentor, and steps for the process.

Paul Scheid from MILRD for his support and immense help with editing.

Sources Used:

Bowtie2: http://bowtie-bio.sourceforge.net/bowtie2/index.shtml

Cell Systems Papers: https://www.sciencedirect.com/science/article/pii/S2405471215000022

Human Microbiome Project: https://www.hmpdacc.org/

Kraken: http://ccb.jhu.edu/software/kraken/

MetaSUB: http://metasub.org/

PathoMap: http://www.pathomap.org/

Pomegranate and Human Skin Microbiota Paper: https://www.researchgate.net/figure/Relative-abundance-of-skin-microbiota-before-and-after-pomegranate-and-placebo_fig1_336383548

--

--