Corona’s core compared to other viruses

Building phylogenetic trees in Python

Published in

Orikami blog

8 min readJun 16, 2020

The world has been taken hostage by that little cluster of proteins called SARS-CoV-2 and causing COVID-19. While sitting at home, locked down, I started pondering about biology, epidemiology, and the wider perspective. It seems to make sense comparing the current pandemic to previous outbreaks of contagious diseases. Quite frequently this happens based on number of infected people, fatalities, and basic reproduction numbers (r0). From a data science perspective several dashboards have been composed that allow monitoring day-to-day information on hospitalization, ICU occupation, infected and diseased people. Additionally, other effects on society are also monitored such as traffic, hotel bookings, and restaurant visits. In this blog I would like to take a somewhat other perspective and try to compare the nature of SARS-CoV-2 to other viruses. I do so by using available genetic data of viruses to build phylogenetic trees using Python.

Sequenced genomic data

Before we go into the trees, let’s first take a look at genomes. You most certainly heard of the sequencing of the human genome. The sequencing of the human genome was preceded by the genomes of simpler species such as yeast (Saccharomyces cerevisiae) and fruit flies (Drosophila melanogaster); and followed by many other species such as cows (Bos taurus), saltwater crocodiles (Crocodylus porosus), and red birds of paradise (Paradisaea rubra). As shown by the provided links, this data is not stowed away in basements of national institutes, or saved in the outskirts of the solitary, august scientists’ PCs, but made publicly available. Everyone with a computer and internet connection can take notice of this information and use it.

The readily availability of genomic data is special

The readily availability and practical character of this data is rather special. For a lot of datasets that is often not the case. Many highly practical datasets, such as collected by weather institutes, Facebook or Google are proprietary, and not freely available. Other datasets, such as astronomical data is freely available, but often not very practical. It is customary for new publications in genomic research to publish the data online for others to check and build upon.

Viral genomes

In this blog I would like to show you how to analyse genomic data on a timely topic: the SARS-CoV-2 virus causing COVID-19 and it’s relation to other viruses. The relation between biological species is often depicted in phylogenetic trees. In such trees species with similar characteristics are grouped close together with short branches in between. Species that have little in common are positioned far apart. That means, for example, that hairy animals are grouped together and separated from the feathery or spineless species. The particular branching of the tree can provide insight into the evolutionary development and relationships. A good introduction into phylogenetic trees can be found in this blog from Khan Academy.

Today we do not have to rely on morphology of the species alone to build such trees, but we can make use of the core of their being: their DNA. (Or RNA for many viruses, but let’s not dwell on that topic now.) For this blog I have gathered a number of viruses notoriously infamous for causing various diseases. The associated RNA sequences can be easily located and downloaded from the NCBI website. The viruses I have used for this blog are listed in the table below.

Table with the name of the virus, the NCBI accession number, and a description of the disease caused by the virus.

Based on this table the actual download of the viruses’ genetic code is not difficult. Many websites show how to do so; for example the Biopython cookbook chapter 9 which deals with access to NCBI databases. The python code below is based on this cookbook and shows how I approached the download:

Firstly the table above, with the accession numbers of all viruses, is loaded into memory.
Secondly, the accession numbers in this table are used to search the NCBI database, which hosts a lot of genetic data. The Entrez module provides a convenient way to communicate with this database. It is advised to provide your e-mail address such that NCBI can contact you in case of problems. The accession numbers from the table are cast into a query-string that is used to query the NCBI database. The query result is subsequently read and stored in the variable result. All identifiers in this result are then fetched from the NCBI database and stored in the list seq_records.
Thirdly some basic characteristics are plotted (more on that in a minute), and lastly the downloaded sequences are saved in a ‘fasta’ file for later usage.

Python code to download the genetic sequences from NCBI and make some basic plots.

Now a bit more on the plots. I have plotted a few — rather basic — characteristics of the sequences (see below). First of all the length of the sequences, which represents the total number of nucleotides in the genetic code: the total number of guanine (G), adenine (A), cytosine (C) and thymine (T) in DNA or uracil (U) in RNA. As can be seen from the figure the sequences are quite different in length. While the corona-family is as long as 30000 nucleotides, the influenza type viruses are as short as 1000–1500 nucleotides.

Additionally the percentage of the nucleotides that equal G or C is also plotted. The GC-content can vary quite a bit between species, and also across the genome. It has been used for classifying bacteria, and some mammals have remarkably high GC-content. The GC-content is easy to calculate and frequently inspected. Therefore I have added a plot showing GC-content, which shows some differences between the selected viruses.

Genome length and GC content of the selected viruses

Alignment

To properly compare the sequences, we need to align them together. That means, we need to search for parts in the sequences that are similar to each other. Which parts are the same, which are similar, and where are parts missing from one genome or the other? The sequences vary quite a bit in length. And, since they originate from very different virus-families, they are expected to be quite different from each other. Aligning all of them together is therefore a stretch — like comparing apples and oranges. If we want to do that anyway there are a number of alignment tools to our disposal. One of the best and fastest tools around is called MUSCLE, which can even be run online.

The code below shows how to do so using Python, inspired again by the BioPython cookbook example. The core of this code is the execution of the alignment, no surprises there. Alignment of sequences is a computationally intensive task. Therefore, the best way to align sequences in Python is to first download the MUSCLE executable and call this program from within Python. That is done by defining the command line in- and output together with a number of options. Importantly, you need to provide the location and name of the executable. Additionally, since our sequences are rather dissimilar, attempting to reach fine alignments with more than two iterations of the algorithm are futile.

When the alignment is completed, the result can be obtained from the standard output of the MUSCLE executable. This result is plotted and saved for further usage as ‘fasta’ file.

Python code to align the downloaded sequences and create a plot visualizing the alignment.

Printing the resulting alignment is not so insightful; it is visually more attractive to create a figure. As can be seen in the figure below the overlap is quite sparse, especially for the influenza viruses. This makes actually sense, since these sequences are the shortest. There is actually quite a bit of overlap between the corona-family of viruses: SARS, MERS, and COVID-19. Since they are members of the same family, it is to be expected that they look alike, also from a genetic perspective.

Figure that shows the overlap in genetic code between the viruses. The flu-viruses are much shorter, and therefore have little overlap with the corona-family of viruses. Figure was inspired by Damien Farrell.

Building phylogenetic trees

Using this alignment result we can start calculating the distance between the sequences and build a phylogenetic tree. Different methods can be used for calculating the distance between sequences. Unfortunately the documentation for these methods is rather brief. The ‘identity’ method seems to reflect the fraction of mismatches, while the ‘blastn’ method provides different scores for matches and mismatches. For simplicity reasons I gladfully have applied the ‘identity’ method.

From biology it is known that different type of mutations do not happen equally frequently. Changes from one purine nucleotide to another (A↔G) and changes from one pyrimidine nucleotide to another (C↔T) or (C↔U) are called transitions. Changes between purine nucleotides and pyrimidine nucleotides are called transversions. Transitions are more likely to happen than transversions. Therefore it seems to make sense to use different penalties for different type of mutations (transitions or transversions). This can be done by using the method ‘trans’.

Next, the tree is constructed using the UPGMA method: Unweighted Pair Group Method with Arithmetic Mean. This method joins the two nearest clusters into a higher-level cluster at each step. It is a rather simple and fast method. Other methods exist as well, most notably the parsimony method. This method joins two clusters together that need the least amount of base pair mutations. One can pretty easily try other methods by replacing the UPGMA by other algorithms as laid out in the Biopython documentation.

All that is left that is a bit of visual polishing; I remove the non-terminal names, and replace the names of the other terminals by names that are easier to comprehend. When that is done, the tree can be printed in ascii, which has a bit of the ’80 flavour, or plotted in a plot as shown below.

Python code to build and plot the phylogenetic tree.

From the figure it can be seen that Covid-19 is most similar to the SARS and MERS viruses, as expected. The figure also shows the flu-viruses nicely clustered together, as are two types of HIV viruses. Clearly, the corona family, among which Covid-19, is unrelated to the flu-viruses, which is the one of the reasons the human immune system has trouble fighting this disease. The Corona family has remarkably longer branches than, say, the flu viruses. Although SARS, MERS and COVID-19 are next of kin, apparently they are still not so close.

Closing words

As mentioned before, comparing such a wide variety of viruses is like comparing apples and oranges. In fact, these viruses could be considered incomparable. Nevertheless, for me it has been an insightful exercise to use the publicly available genomic data to build up a phylogenetic tree. Bits and pieces in the used code can be repurposed to build trees for, say, the genes that code for haemoglobin in ungulates. I will save that discussion for another blog. In the meantime I would be very interested in hearing your opinion on this blog!