SARS-CoV-2 genome Analysis With Biopython
Simple tutorial using biopython library to analyze the SARS-CoV-2’s genome in python.
In this article, we will identify the viral proteins in the novel SARS-CoV-2 genome with biopython and do a comparative analysis with the SARS and bat coronaviruses.
Materials and methods:
- Download the genome file from the kaggle dataset: SARS-CoV-2 genbank file. (NCBI accession number: NC_045512)
- SARS-CoV-2 genome analysis was carried with Biopython and DNA features viewer libraries.
- PSI-BLAST search was used to compare our SARS-CoV-2 proteins with the SARS and Bat coronavirus proteins.
1- SARS-CoV-2 genetic information:
Genome sequencing has increasingly become an important tool for studying disease outbreaks. The SARS-CoV-2 genome used in this study was sequenced from a sample of bronchoalveolar lavage fluid from a single patient who was a worker at the market and who was admitted to the Central Hospital of Wuhan on 26 December 2019 .
2- SARS-CoV-2’s genome:
The SARS-CoV-2 genome has 29903 nucleotides (sequence of A T C and G), to have an idea of how small it is, the human genome has more than 3 billion nucleotides. The small size of the viruses and microorganisms make them easier to study compared to the human genome.
We start our analysis by importing the Biopython library and reading the genome file. The code below displays the SARS-CoV-2’s first 1000 nucleotides (out of 29903).
This sequence of nucleotides contains all the information encoded in the virus. Understanding this genetic information is the key to find cures and vaccines. So, the question here is: How do we extract information from this long sequence of letters?
Before moving forward, let’s breakdown this long string to single characters and see the distribution of the nucleotides (A,T,C,G) over the SARS-CoV-2’s genome. The DNA sequence from NCBI is used as a starting point.
How do we extract information from this long sequence of letters? This process is called gene expression: Gene expression is the process by which information from a gene is used in the synthesis of a functional gene product. These products are often proteins.
- Transcription: DNA is copied out into a messenger RNA (mRNA)
- Translation: mRNA is translated into amino acids
- Amino acid folding: A sequence of 20 or more amino acids (the building blocks of proteins) form a protein.
Transcription is the first step in gene expression. It involves copying a gene’s DNA sequence to make an RNA molecule.
- Basically the mRNA is a copy of our DNA. However, in RNA, a base called uracil (U) replaces thymine (T) as the complementary nucleotide to adenine (that’s the only difference, T is replaced by U).
- We can transcribe our DNA sequence with biopython’s transcribe() function
To see the difference ( T replaced by U) between both DNA and RNA, we align both sequences together:
Translation is the process that takes the information passed from DNA as messenger RNA and turns it into a series of amino acids.
It is essentially a translation from one code (nucleotide A T C G sequence) to another code (amino acid sequence).
How does this translation happen? As in any language, we need a dictionary for translation. In this case, the amino acid dictionary is the table below. The nucleotides are read in groups of three “AUG GCC CAG UUA …”. Each triplet is called a codon and codes for a specific amino acid.
There are 61 codons for 20 amino acids, and each of them is “read” to specify a certain amino acid out of the 20 commonly found in proteins.
Luckily, with the translate() function, biopython translates the mRNA to amino acids chains. Chains are separated with a * which is the stop codon ( UAA, UAG and UGA) .
- We have several chains of a total of 9967 amino acids separated with stop codons *
- The split() function splits the sequence at any stop codon and keeps the amino acids chains separated. This makes it easier for further analysis.
Now that we have our protein sequences, we will use the BLAST search.
We will try to find the protein sequences already available in the databases that are the most similar to our protein sequences. (Hint: In this case, most probably the proteins that will have the highest similarity with our SARS-CoV-2 belong to the SARS coronavirus or Bat coronavirus).
An example of how BLAST works:
- PSI-BLAST webpage: https://www.ebi.ac.uk/Tools/sss/psiblast/
- Copy/Paste our 83 amino acid chain (from table 1 above) : AQADEYELMYSFVSEETGTLIVNSVLLFLAFVVFLLVTLAILTALRLCAYCCNIVNVSLVKPSFYVYSRVKNLNSSRVPDLLV
- The figure below shows the result of our 83 amino acids chain: It’s the Envelope small membrane protein!
5- Open reading frames:
An open reading frame is a portion of a DNA molecule that, when translated into amino acids, contains no stop codons. The genetic code reads DNA sequences in groups of three base pairs, which means that a double-stranded DNA molecule can read in any of six possible reading frames — three in the forward direction and three in the reverse. A long open reading frame is likely part of a gene .
The figure above shows the localization of the ORFs in the SARS-CoV-2 genome. From left to right: ORF1ab (Replicase protein), S (Spike protein), E (envelope protein) and N (Nucleoprotein).
Next, we’ll use the biopython find_orfs_with_trans() function to find the amino acid sequences corresponding to the ORFs found above in the figure 11 to run a BLAST search and find out more about the functionality of those viral proteins.
After printing out ORF sequences, we run a BLAST search as explained earlier in figure 10, and find the proteins with the highest affinity to our sequences in the BLAST database. The table below shows the results of the 6 sequences printed in figure 12.
This article is a short version of my Kaggle notebook COVID-19: Proteins identification with Biopython. If you would like to reproduce the code, please check the notebook.
We did a genome analysis with biopython and managed to find SARS-CoV-2's main viral proteins (figure below) : Spike protein (S), Nucleoprotein (N), Membrane protein (M), Envelope protein (E) and the replicase polyprotein (1ab).
 A new coronavirus associated with human respiratory disease in China https://www.nature.com/articles/s41586-020-2008-3#rightslink
 NIH. Open Reading Frames https://www.genome.gov/genetics-glossary/Open-Reading-Frame