Covid-19 Protein Analysis using Python

Learn the basics of protein analysis with the covid-19 spike protein (part 1 of 5 of tutorial)

Varun Sendilraj
Analytics Vidhya
4 min readSep 20, 2021

--

Source: Orchard Software

Introduction

Proteomics is the study of proteins in biological systems. Typically in this field the proteins structure, functions, and interactions are studied. But one of the most important goals of this field is to characterize the 3D structure of the protein. In the next articles we will be exploring ways to analyze the 3d structure of the covid-19 spike protein. But before we get to that lets start by getting an understanding of the sequence of DNA we are working with and perform protein synthesis using python.

Setting up the environment

  1. Download the Anaconda Package
  2. Download Python 3
  3. Install Biopython from here
  4. Open up Jupyter Notebook

Downloading the Data

For this demonstration we are going to need the FASTA sequence for the covid-19 genome. To download the data for the project use the wget command below:

If you are on widows or mac, the wget command wont work so use the urlib library instead :

Windows/Mac:

Understanding the FASTA File

Lets start of our analysis by importing the Biopython library and parsing the genome file:

In a typical FASTA file, there are multiple other records stored on there in addition to the sequence of interest. Lets iterate through the FASTA file to isolate and print out the information of interest:

As we see above the alphabet of our sequence is SingleLetterAlphabet(). Since we are working with DNA lets convert the alphabet to unambiguous_dna() so we can perform specific operations later on in the tutorial:

Now if we were to print out the sequence, it would look very intimidating and confusing:

So instead lets count the amount of each nitrogenous base in the genome using a defaultdict:

And lets analyze the percentage of each base:

Covid-19 Protein Synthesis

Now that we have an understanding of the genome, lets convert the DNA into Proteins. The first step of this process is called transcription. This is the process in which a DNA sequence is converted into an RNA molecule with the help of enzyme RNA polymerase. We can actually simulate this with Biopython’s transcribe() function:

With the above code we can see that the T (thymine) is replaced by U (Uracil). This shows that the transcription was successful. The next step is to convert this to a protein through translation. Translation is the term used to describe the process of protein synthesis by ribosomes. Similar to the example above this can be done with a translate() function:

The stop codons are represented by the asterisk (*). This tells the ribosome to stop the building of that specific protein at that site. With that knowledge lets split the sequence into its separate proteins and organize it in a data frame:

And the output should look like this:

Image By: Varun Sendilraj

Lets look at the longest and shortest Amino Acid Sequences (proteins):

Longest Sequences:

Image By: Varun Sendilraj

Shortest Sequences:

Image By: Varun Sendilraj

Finally lets look into the most common amino acids:

In the end we have taken a covid-19 genome and conducted protein synthesis. In the next tutorial we will take the PDB file of the SARS-CoV-2 spike and do some protein visualization and 3d analysis.

--

--

Varun Sendilraj
Analytics Vidhya

I am a researcher at Georgia Tech and Emory Winship Cancer Institute. Work mainly with AI enabled digital pathology, biomedical engineering, and ML