The Startup
Published in

The Startup

SARS-CoV-2 genome Analysis With Biopython

Simple tutorial using biopython library to analyze the SARS-CoV-2’s genome in python.

SARS-CoV-2's genome analysis with biopython. Created with [Image by Author]

In this article, we will identify the viral proteins in the novel SARS-CoV-2 genome with biopython and do a comparative analysis with the SARS and bat coronaviruses.

Materials and methods:

  • Download the genome file from the kaggle dataset: SARS-CoV-2 genbank file. (NCBI accession number: NC_045512)
  • SARS-CoV-2 genome analysis was carried with Biopython and DNA features viewer libraries.
  • PSI-BLAST search was used to compare our SARS-CoV-2 proteins with the SARS and Bat coronavirus proteins.

1- SARS-CoV-2 genetic information:

Genome sequencing has increasingly become an important tool for studying disease outbreaks. The SARS-CoV-2 genome used in this study was sequenced from a sample of bronchoalveolar lavage fluid from a single patient who was a worker at the market and who was admitted to the Central Hospital of Wuhan on 26 December 2019 [1].

2- SARS-CoV-2’s genome:

The SARS-CoV-2 genome has 29903 nucleotides (sequence of A T C and G), to have an idea of how small it is, the human genome has more than 3 billion nucleotides. The small size of the viruses and microorganisms make them easier to study compared to the human genome.

Figure 1. DNA is a sequence of 4 nucleotides: A T C G. Source:

We start our analysis by importing the Biopython library and reading the genome file. The code below displays the SARS-CoV-2’s first 1000 nucleotides (out of 29903).

Figure 2. SARS-CoV-2: First 1000 nucleotides (out of 29903).

This sequence of nucleotides contains all the information encoded in the virus. Understanding this genetic information is the key to find cures and vaccines. So, the question here is: How do we extract information from this long sequence of letters?

Before moving forward, let’s breakdown this long string to single characters and see the distribution of the nucleotides (A,T,C,G) over the SARS-CoV-2’s genome. The DNA sequence from NCBI is used as a starting point.

Figure 3. Nucleotides distribution in SARS-CoV-2’s genome. [Image by Author]

How do we extract information from this long sequence of letters? This process is called gene expression: Gene expression is the process by which information from a gene is used in the synthesis of a functional gene product. These products are often proteins.

Figure 4. Gene expression: DNA is transcripted to an RNA molecule, RNA is translated to amino acids, amino acids chains form proteins. Source:
  • Transcription: DNA is copied out into a messenger RNA (mRNA)
  • Translation: mRNA is translated into amino acids
  • Amino acid folding: A sequence of 20 or more amino acids (the building blocks of proteins) form a protein.

3- Transcription:

Transcription is the first step in gene expression. It involves copying a gene’s DNA sequence to make an RNA molecule.

  • Basically the mRNA is a copy of our DNA. However, in RNA, a base called uracil (U) replaces thymine (T) as the complementary nucleotide to adenine (that’s the only difference, T is replaced by U).
  • We can transcribe our DNA sequence with biopython’s transcribe() function

To see the difference ( T replaced by U) between both DNA and RNA, we align both sequences together:

Figure 5. Alignment of the first 50 nucleotides of DNA and RNA sequences

4- Translation

Translation is the process that takes the information passed from DNA as messenger RNA and turns it into a series of amino acids.

It is essentially a translation from one code (nucleotide A T C G sequence) to another code (amino acid sequence).

How does this translation happen? As in any language, we need a dictionary for translation. In this case, the amino acid dictionary is the table below. The nucleotides are read in groups of three “AUG GCC CAG UUA …”. Each triplet is called a codon and codes for a specific amino acid.

Figure 6. The genetic Codon Chart to pair each codon (3 nucleotides) with its corresponding amino acid. Source:

There are 61 codons for 20 amino acids, and each of them is “read” to specify a certain amino acid out of the 20 commonly found in proteins.

Luckily, with the translate() function, biopython translates the mRNA to amino acids chains. Chains are separated with a * which is the stop codon ( UAA, UAG and UGA) [2].

Figure 7. Amino acid sequence translated from RNA.
Figure 8. Total number of amino acids in the SARS-CoV-2 genome.
  • We have several chains of a total of 9967 amino acids separated with stop codons *
  • The split() function splits the sequence at any stop codon and keeps the amino acids chains separated. This makes it easier for further analysis.
Figure 9. Protein sequences separated from the long amino acid sequence.
Table 1. SARS-CoV-2’s protein sequences with more than 50 amino acids.

Now that we have our protein sequences, we will use the BLAST search.

We will try to find the protein sequences already available in the databases that are the most similar to our protein sequences. (Hint: In this case, most probably the proteins that will have the highest similarity with our SARS-CoV-2 belong to the SARS coronavirus or Bat coronavirus).

An example of how BLAST works:

  • PSI-BLAST webpage:
  • The figure below shows the result of our 83 amino acids chain: It’s the Envelope small membrane protein!
Figure 10. Screenshot of the BLAST search results of our amino acid sequence.

5- Open reading frames:

An open reading frame is a portion of a DNA molecule that, when translated into amino acids, contains no stop codons. The genetic code reads DNA sequences in groups of three base pairs, which means that a double-stranded DNA molecule can read in any of six possible reading frames — three in the forward direction and three in the reverse. A long open reading frame is likely part of a gene [3].

In order to visualize the open reading frames in our SARS-CoV-2 genome, we need to install the dna_features_viewer library for visualization and the bcbio-gff library for parsing.

Figure 11. Visualization of the ORFs in the SARS-CoV-2 genome. ORF1ab (Replicase protein), S (Spike protein), N (Nucleoprotein), E (Envelope protein) [Image by Author]

The figure above shows the localization of the ORFs in the SARS-CoV-2 genome. From left to right: ORF1ab (Replicase protein), S (Spike protein), E (envelope protein) and N (Nucleoprotein).

Next, we’ll use the biopython find_orfs_with_trans() function to find the amino acid sequences corresponding to the ORFs found above in the figure 11 to run a BLAST search and find out more about the functionality of those viral proteins.

Figure 12. ORFs. 1- Amino acid sequence, Sequence length, Strand, ORF’s position in the sequence

After printing out ORF sequences, we run a BLAST search as explained earlier in figure 10, and find the proteins with the highest affinity to our sequences in the BLAST database. The table below shows the results of the 6 sequences printed in figure 12.

Table 2. SARS-CoV-2’s Viral proteins and their match percentage with SARS and Bat coronaviruses from BLAST search.
Figure 13. Amino acid sequence length of SARS-CoV-2’s ORFs. [Image by Author]

This article is a short version of my Kaggle notebook COVID-19: Proteins identification with Biopython. If you would like to reproduce the code, please check the notebook.


We did a genome analysis with biopython and managed to find SARS-CoV-2's main viral proteins (figure below) : Spike protein (S), Nucleoprotein (N), Membrane protein (M), Envelope protein (E) and the replicase polyprotein (1ab).

Figure 14. SARS-CoV-2-19’s viral proteins. Source:


[1] A new coronavirus associated with human respiratory disease in China

[2] Overview of Translation:

[3] NIH. Open Reading Frames



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store