Published in


Genomic sequencing: A necessity for the thriving of the human race

Figure 1: Photo by National Cancer Institute on Unsplash

During the past few months, all of humanity is held captive by this hideously destructive virus. SARS-CoV-2, also known as COVID-19, is very well known to us by now. Where many helpless individuals have fallen prey to this deadly virus raging like wildfire, others among us are finding a cure for this virus. This is where genomic sequencing comes into the picture.

So what is Genome?

According to some experts, the genome is an organisms complete set of instructions. These instructions are responsible for giving a giraffe its long neck or elephant its humongous size. It is also responsible for assigning life-threatening properties to a virus, like CoronaVirus. As stated by Dr Preeti Kumar, Vice President- Public Health System Support at the Public Health Foundation of India (PHFI),

As we know well that virus undergoes mutation, as they replicate. Mutation are tiny errors in the genetic code that occur during the duplication process. While some are harmless, others can bring about such changes in the virus that can make it more transmissible or deadly. So, it is important to keep a track of the genomic sequencing in order to check the virulence of the virus. As the number of differences between viral strains increase, it’s likely that the resulting viruses will also behave differently in terms of how fast it spreads, the kind of symptoms and the intensity of the disease it causes. The differences can be beneficial, harmful or inconsequential.

Analysing different genetic code of viruses from different patients helps identify the transmission of the virus in real-time and numerous strains dominating a particular area. Consequently, helping to build different strategies for clinical healthcare in that region. Furthermore, sequencing helps identify the superspreaders and hotspots. Therefore, an understanding of viral genomic sequencing can help researchers and scientists design the vaccines and also in combating the persistently mutating and evolving virus. After laying out the importance of genomic sequencing, let us proceed to the sequencing and analysis segment.

Python provides various support libraries explicitly designed for the sequencing of DNA. We will be using Bio python in python language for sequencing analysis. The following points are addressed in this article:

  1. Juxtapose the DNA and amino acids sequence of various viruses.
  2. Compare the GC content (Guanine-Cytosine content).
  3. Analysis of the frequency of the amino acids in the sequence.
  4. Similarity measure between the protein sequences.
  5. Visualization of the 3D structure of the sequence.

We will use Biopython for parsing DNA sequence data. It provides a simple uniform interface to input and output various sequence file formats. Here, we will use .FASTA file format for DNA sequence and .PDB for parsing 3D protein structure.


The data used for this analysis is taken from National Center for Biotechnology Information(NCBI) site. Links for the data are available on the GitHub repository mentioned at the end of this article. The data consists of .FASTA format for DNA sequence and .PDB(Protein Data Bank) for 3D structure analysis.

The DNA has a double helix structure consisting of two strands. Each strand is attached with the other through nucleotides with either double bond or triple bond, depending on the nucleotide pair. Whereas RNA is single-stranded. Every single character in the DNA sequence is called a nucleotide. Nucleotides are the building blocks of DNA and RNA sequence. The DNA and RNA sequence consists of four different nucleotides. The nucleotide for DNA are :

  • A: adenine
  • C: cytosine
  • G: guanine
  • T: thymine

In RNA sequence, Thymine is replaced by another nucleotide called as :

  • U: Uracil

Figure 2 shows a clear chemical structure of all five nucleotides.

Figure 2: Types of Nucleotides in a genomic sequence (Taken from

The nucleotide A only attaches itself with T with a double bond, thus forming an AT pair. On the other hand, only C and G can attach, forming a triple bond pair, as shown in figure 3. Any other pair combination is not possible. These are the only possible combinations, AT and GC.

Figure 3: Bonds of GC and AT pairs (Taken from

Importing necessary files and Reading the data

from Bio.Seq import Seq 
from Bio.SeqUtils import GC, seq3
from Bio.SeqUtils.ProtParam import ProteinAnalysis
from Bio import pairwise2
from Bio.pairwise2 import format_alignment
from Bio import SeqIO
from collections import Counter
import matplotlib.pyplot as plt
from Bio.PDB import PDBParser
import nglview as nv
import numpy as np
import pandas as pd
import warnings

For the analysis using python, some libraries need to be imported. BioPython library is used to parse and analyse the data, whereas nglview and matplotlib libraries are used for visualization purposes. These and some more libraries are imported as shown in the above code snippet.

#Parsing fasta files
COVID ="Data\covid19_sequences.fasta","fasta")
MERS ="Data\mers_sequences.fasta","fasta")
SARS ="Data\sars_sequences.fasta","fasta")
EBOLA ="Data\ebola_sequences.fasta","fasta")
ZIKA ="Data\zika_sequences.fasta","fasta")
HIV ="Data\hiv_sequences.fasta","fasta")
#Parsing pdb files
COVID_structure = PDBParser().get_structure("COVID_6lu7","Data\COVID_6lu7.pdb")
MERS_structure = PDBParser().get_structure("MERS_6lu7","Data\MERS_6war.pdb")
SARS_structure = PDBParser().get_structure("SARS_6lu7","Data\SARS_6vxx.pdb")
EBOLA_structure = PDBParser().get_structure("EBOLA_6lu7","Data\EBOLA_6nae.pdb")
ZIKA_structure = PDBParser().get_structure("ZIKA_6lu7","Data\ZIKA_6co8.pdb")
HIV_structure = PDBParser().get_structure("HIV_6lu7","Data\HIV_1dmp.pdb")

Ensuing importing of libraries, we are ready for reading the virus data from the downloaded files. For reading the sequencing data, a SeqIO reader is used. The file contains all data and metadata along with the genomic sequence. However, we are interested in the genomic sequence of the virus’s DNA for our analysis purpose. Thus, the .seq method is used to access only the genomic sequence data we are interested in, which we can use for further analysis. This is shown in the below snippet. Along with this, the output below shows the string of DNA sequence with a length of up to ten nucleotides.

COVID_sequence = COVID.seq
MERS_sequence = MERS.seq
SARS_sequence = SARS.seq
EBOLA_sequence = EBOLA.seq
ZIKA_sequence = ZIKA.seq
HIV_sequence = HIV.seq
print("First 10 nucleotides in DNA sequence of COVID DNA are : ",COVID_sequence[:10])
print("First 10 nucleotides in DNA sequence of MERS DNA are : ",MERS_sequence[:10])
print("First 10 nucleotides in DNA sequence of SARS DNA are : ",SARS_sequence[:10])
print("First 10 nucleotides in DNA sequence of EBOLA DNA are : ",EBOLA_sequence[:10])
print("First 10 nucleotides in DNA sequence of ZIKA DNA are : ",ZIKA_sequence[:10])
print("First 10 nucleotides in DNA sequence of HIV DNA are : ",HIV_sequence[:10])
First 10 nucleotides in DNA sequence of COVID DNA are : ATTAAAGGTT
First 10 nucleotides in DNA sequence of MERS DNA are : ATTTAAGTGA
First 10 nucleotides in DNA sequence of SARS DNA are : ATATTAGGTT
First 10 nucleotides in DNA sequence of EBOLA DNA are : CGGACACACA
First 10 nucleotides in DNA sequence of ZIKA DNA are : AGTTGTTGAT
First 10 nucleotides in DNA sequence of HIV DNA are : GGTCTCTCTG

The above output for the code snippet shows that the nucleotides A, T, C, G are present, and U is absent, thus ensuring that the sequence is a DNA sequence and not an RNA sequence.

Length of the Sequence and the GC content

GC content of DNA is the number of G-C pairs of nucleotides. It is an important criterion for accessing the stability of the DNA. The higher the GC pairs, the more stable is the DNA. Also, a higher GC-content level indicates a relatively higher melting temperature. Thus, under pressure, such as when exposed to heat, the GC-rich sequences can take far more abuse than GC-low sequences. GC-content is usually expressed as a percentage value and is calculated as:

Figure 4: GC Content ratio

AT content is the number of pairs of A and T nucleotides. AT/GC ratio is calculated as,

Figure 5: AT-GC ratio

Both these ratios give independent information regarding the GC content. The first ratio gives the absolute GC content in the sequence. In comparison, the second ratio gives information of GC concerning AT content. In python, the GC content is calculated using the inbuilt function GC(param), where the param is the complete genomic sequence of the virus. This is shown in the code below,

print("Number of GC pair in COVID sequence is: ",GC(COVID_sequence))
print("Number of GC pair in SARS sequence is : ",GC(SARS_sequence))
print("Number of GC pair in MERS sequence is : ",GC(MERS_sequence))
print("Number of GC pair in EBOLA sequence is: ",GC(EBOLA_sequence))
print("Number of GC pair in ZIKA sequence is : ",GC(ZIKA_sequence))
print("Number of GC pair in HIV sequence is : ",GC(HIV_sequence))
Number of GC pair in COVID sequence is : 37.97277865097148
Number of GC pair in SARS sequence is : 40.840901446350486
Number of GC pair in MERS sequence is : 41.17764272192886
Number of GC pair in EBOLA sequence is : 45.48127921020848
Number of GC pair in ZIKA sequence is : 51.258327165062916
Number of GC pair in HIV sequence is : 42.119594815379585

The sequence's length is the total characters in the DNA sequence, i.e. the total number of nucleotides on one DNA strand. Length is an important parameter for calculating the similarity of the two sequences, which we will compute later in this article. The length of the sequence is calculated in python as shown below,

print("Length of COVID sequence is : ",len(COVID_sequence))
print("Length of MERS sequence is : ",len(MERS_sequence))
print("Length of SARS sequence is : ",len(SARS_sequence))
print("Length of EBOLA sequence is : ",len(EBOLA_sequence))
print("Length of ZIKA sequence is : ",len(ZIKA_sequence))
print("Length of HIV sequence is : ",len(HIV_sequence))
Length of COVID sequence is : 29903
Length of MERS sequence is : 30111
Length of SARS sequence is : 29730
Length of EBOLA sequence is : 19043
Length of ZIKA sequence is : 10808
Length of HIV sequence is : 9181

Conversion of DNA to proteins/Amino acids

Proteins analysis forms an important part of genomic sequencing. But the downloaded data is in the form of a DNA sequence. Thus for sequencing purpose, it is necessary to convert the DNA sequence to a protein sequence. After obtaining the protein sequence, analysis can be done based on the frequency of a particular protein. The Proteins frequency is how many times a particular protein is repeated in the whole sequence and its position. The frequency of proteins affects the behavioural pattern and the characteristics of the virus.

Figure 6: Conversion of DNA sequence to amino acid chain (Taken from

The conversion process from DNA to a complete protein sequence is called protein biosynthesis. The process is shown in figure 6. It is composed of two different stages.

  • The first stage is called transcription. This is where a strand of double helix DNA (a gene) is unwound so that one side can be read by an enzyme called RNA polymerase(shown in green colour). Further, this converts the grey colour DNA to the pink colour mRNA sequence.
  • The second part of the process is called translation. The translation is translating the sequence of a messenger RNA (mRNA) molecule to a polypeptide chain sequence. The nucleotides in the mRNA are read as groups of three called codons. Codon is a nucleotide triplet that encodes an amino acid. Each codon corresponds to a particular amino acid. To illustrate this, a codon takes three nucleotides from the mRNA sequence and converts the three-nucleotide into a single protein. This process is continued until the end of the chain; thus, the amino sequence's length is one-third of the RNA genome sequence length. This conversion from mRNA to polypeptide chain is called translation.

Transcribe and Translation in BioPython is done by implementing .transcribe() and .translate() method. The below cell shows DNA, mRNA and protein chain in COVID virus(First 10 characters). Every character in the protein sequence is a single protein structure. Furthermore, it also indicates that U in mRNA replaces the T nucleotide in the DNA sequence. In the protein chain, I, K, G are individual proteins.

print("DNA sequence for COVID is : ",COVID_sequence[:10])
print("mRNA sequence for COVID is : ",COVID_mRNA[:10])
print("Protein/AA sequence for COVID is : ",COVID_proteins[:10])
DNA sequence for COVID is : ATTAAAGGTT
mRNA sequence for COVID is : AUUAAAGGUU
Protein/AA sequence for COVID is : IKGLYLPR*Q

There are 20 types of proteins in the sequence, denoted by symbols,

  • alanine — ala — A
  • arginine — arg — R
  • asparagine — asn — N
  • aspartic acid — asp — D
  • cysteine — cys — C
  • glutamine — gln — Q
  • glutamic acid — glu — E
  • glycine — gly — G
  • histidine — his — H
  • isoleucine — ile — I
  • leucine — leu — L
  • lysine — lys — K
  • methionine — met — M
  • phenylalanine — phe — F
  • proline — pro — P
  • serine — ser — S
  • threonine — thr — T
  • tryptophan — trp — W
  • tyrosine — tyr — Y
  • valine — val — V

Protein Analysis

The number of occurrences of a specific protein and its location in the sequence plays a crucial role in species development. If certain amino acids are optimal for protein structure, natural selection should have acted over evolutionary time to increase these amino acids frequency. And by analysing this frequency, we can find a lot more information regarding the virus's function. Python provides a way to calculate the frequency of all the proteins present in the sequence. It can be calculated using the method count_amino_acids(). The output of this function which is the frequency, is then plotted on a graph using matplotlib. The results of the proteins frequency for the different virus is shown below via a line graph,

Figure 7: Line graphs of, type of protein to its frequency in a specific sequence

The above graph shows that COVID-19, MERS, and SARS show similar graph patterns, which manifest their related nature. Whereas HIV, ZIKA and EBOLA display a different scenario regarding the distribution of the amino acids in the protein sequence.

Similarity Index of COVID with other viruses

The similarity index gives a value that states how similar the two sequences are compared to each other. It can vary between 0 to 100. Where 0 means they are not at all similar, while 100 means they are completely the same. Both of these are extreme and thus improbable values for similarity index for two distinct virus sequences. It is calculated as the number of the same nucleotides to the total nucleotides. Another method for calculating similarity Index is by using hamming distance and edit distance.

Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. In other words, it measures the minimum number of substitutions required to change one string into the other or the minimum number of errors that could have transformed one string into the other. It is used for error detection or error correction and also used to quantify the similarity of DNA sequences. It only allows substitution operation to be performed on the sequence.

Edit distance: Edit distance is a way of quantifying how dissimilar two strings (e.g., sequence) are to one another by counting the minimum number of operations required to transform one string into the other. e.g. Levenshtein distance is a form of Edit distance. One major difference between hamming and edit is, hamming is used on the same length string. In contrast, the edit distance can be used for the different string as it allows many more operation on the string like insertion, deletion, substitution.

Shown below is the similarity index of covid with other viruses.

COVID_n_MERS = pairwise2.align.globalxx(COVID_sequence,MERS_sequence, one_alignment_only=True, score_only=True) 
Similarity_with_MERS = COVID_n_MERS/max(len(COVID_sequence),len(MERS_sequence)) * 100
COVID_n_SARS = pairwise2.align.globalxx(COVID_sequence,SARS_sequence, one_alignment_only=True, score_only=True)
Similarity_with_SARS = COVID_n_SARS/max(len(COVID_sequence),len(SARS_sequence)) * 100
COVID_n_EBOLA = pairwise2.align.globalxx(COVID_sequence,EBOLA_sequence, one_alignment_only=True, score_only=True)
Similarity_with_EBOLA = COVID_n_EBOLA/max(len(COVID_sequence),len(EBOLA_sequence)) * 100
COVID_n_ZIKA = pairwise2.align.globalxx(COVID_sequence,ZIKA_sequence, one_alignment_only=True, score_only=True)
Similarity_with_ZIKA = COVID_n_ZIKA/max(len(COVID_sequence),len(ZIKA_sequence)) * 100
COVID_n_HIV = pairwise2.align.globalxx(COVID_sequence,HIV_sequence, one_alignment_only=True, score_only=True)
Similarity_with_HIV = COVID_n_HIV/max(len(COVID_sequence),len(HIV_sequence)) * 100
print("Similarity index of COVID with MERS virus is : ",Similarity_with_MERS)
print("Similarity index of COVID with SARS virus is : ",Similarity_with_SARS)
print("Similarity index of COVID with EBOLA virus is : ",Similarity_with_EBOLA)
print("Similarity index of COVID with ZIKA virus is : ",Similarity_with_ZIKA)
print("Similarity index of COVID with HIV virus is : ",Similarity_with_HIV)
Similarity index of COVID with MERS virus is : 69.4065291753844
Similarity index of COVID with SARS virus is : 82.83784235695416
Similarity index of COVID with EBOLA virus is : 50.443099354579815
Similarity index of COVID with ZIKA virus is : 34.07684847674146
Similarity index of COVID with HIV virus is : 30.140788549643847

From the above cell output, it can be seen that the similarity index of COVID with SARS is more than that of EBOLA, ZIKA or HIV, which can also be corroborated with the help of the graphs shown above. This shows that the COVID and SARS belong to the same family. Thus have greater similar nucleotides than other viruses.

3D structure of mentioned viruses

The function of proteins is dependent on the shape of the proteins. The 3D structure of proteins provides important information regarding the orientation and connectivity of various elements and determines whether they can interact with specific molecules. It is critical for studying protein folding, evolution, function prediction. As shown in the figure below, the sequence is connected to the neighbouring molecule and is entangled differently for every virus. This is well evident from the 3D structure. The 3D structures can be accessed from the python notebook uploaded on the Github link given below.

Figure 8: Visualization of 3D structure

Raw data used for the above analysis, along with complete genomic sequence analysis code, can be found on GitHub.


  1. Calculation of GC content and AT/GC ratio aids in examining the melting point stability of the sequence for higher temperature. Among the analysed virus sequences, ZIKA has the highest GC content, i.e. higher stability than others. In comparison, COVID-19 having the lowest GC content from the above viruses.
  2. COVID started with 29903 DNA sequence length. The length of the amino acid sequence in COVID becomes 29903/3 = 9967 proteins. The same is the case with other viruses. This conversion from DNA to Amino acids is a two-step process having transcription and translation.
  3. Calculating the frequency of a protein in the amino acid sequence and later visually comparing the frequency on a line graph, we found out that viruses belonging to the same group have a similar frequency line graph. For example, COVID, SARS and MERS, which belong to the same family, have similar graphs.
  4. The number of leucine content in COVID, MERS and SARS are highest, whereas HIV has arginine, EBOLA has serine and ZIKA having leucine protein as the highest, followed by serine.
  5. Based on the pairwise alignment, a similarity index was calculated. Calculation of similarity index concluded that the MERS, SARS and COVID have much higher similitude than other viruses.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store