Covid-19 Protein Analysis using Python

Learn the basics of protein analysis with the covid-19 spike protein (part 1 of 5 of tutorial)

Varun Sendilraj

Published in

Analytics Vidhya

4 min readSep 20, 2021

Introduction

Proteomics is the study of proteins in biological systems. Typically in this field the proteins structure, functions, and interactions are studied. But one of the most important goals of this field is to characterize the 3D structure of the protein. In the next articles we will be exploring ways to analyze the 3d structure of the covid-19 spike protein. But before we get to that lets start by getting an understanding of the sequence of DNA we are working with and perform protein synthesis using python.

Setting up the environment

Download the Anaconda Package
Download Python 3
Install Biopython from here
Open up Jupyter Notebook

Downloading the Data

For this demonstration we are going to need the FASTA sequence for the covid-19 genome. To download the data for the project use the wget command below:

# Get the data
!wget https://raw.githubusercontent.com/VarunSendilraj/Bioinformatics/main/covid19_basic _protien_analysis/sequence.fasta

If you are on widows or mac, the wget command wont work so use the urlib library instead :

Windows/Mac:

import urllib.request
url = 'https://raw.githubusercontent.com/VarunSendilraj/Bioinformatics/main/covid19_basic%20_protien_analysis/sequence.fasta'
filename = 'sequence.fasta'
urllib.request.urlretrieve(url, filename)

Understanding the FASTA File

Lets start of our analysis by importing the Biopython library and parsing the genome file:

import Bio
from Bio import SeqIO # library used to parse the file
from Bio import Seqcovid19 = SeqIO.parse('sequence.fasta', 'fasta')

In a typical FASTA file, there are multiple other records stored on there in addition to the sequence of interest. Lets iterate through the FASTA file to isolate and print out the information of interest:

for rec in covid19:
    seq = rec.seq  #genome sequence
    print(rec.description)
    print(seq[:10])
    print(seq.alphabet)--------------------------------------------------------------------
#expected output:NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome
ATTAAAGGTT
SingleLetterAlphabet()

As we see above the alphabet of our sequence is SingleLetterAlphabet(). Since we are working with DNA lets convert the alphabet to unambiguous_dna() so we can perform specific operations later on in the tutorial:

from Bio import Seq
from Bio.Alphabet import IUPACseq = Seq.Seq(str(seq), IUPAC.unambiguous_dna)

Now if we were to print out the sequence, it would look very intimidating and confusing:

print(seq)--------------------------------------------------------------------#expected output:ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAA
...........
TGAACAATGCTAGGGAGAGCTGCCTATATGGAAGAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAATAGCTTCTTAGGAGAATGACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

So instead lets count the amount of each nitrogenous base in the genome using a defaultdict:

from collections import defaultdict
count = defaultdict(int)for letter in seq:
    count[letter] += 1
    
total = sum(count.values())
count--------------------------------------------------------------------#expected output:defaultdict(int,
    {'A': 8954, 
     'T': 9594, 
     'G': 5863, 
     'C': 5492})

And lets analyze the percentage of each base:

for letter, count in count.items():
    print(f'{letter}: {100.*count / total} {count}')--------------------------------------------------------------------#expected output:A: 29.943483931378122 8954
T: 32.083737417650404 9594
G: 19.60672842189747 5863
C: 18.366050229074006 5492

Covid-19 Protein Synthesis

Now that we have an understanding of the genome, lets convert the DNA into Proteins. The first step of this process is called transcription. This is the process in which a DNA sequence is converted into an RNA molecule with the help of enzyme RNA polymerase. We can actually simulate this with Biopython’s transcribe() function:

rnaCount = defaultdict(int)for letter in rna:
    rnaCount[letter] += 1
    
total = sum(rnaCount.values())
print(rnaCount)--------------------------------------------------------------------#expected output:
defaultdict(int, {'A': 8954, 'U': 9594, 'G': 5863, 'C': 5492})

With the above code we can see that the T (thymine) is replaced by U (Uracil). This shows that the transcription was successful. The next step is to convert this to a protein through translation. Translation is the term used to describe the process of protein synthesis by ribosomes. Similar to the example above this can be done with a translate() function:

protien = rna.translate(stop_symbol="*") #(*)represent stop codonprint(protien)--------------------------------------------------------------------#expected output:
Seq('IKGLYLPR*QTNQLSISCRSVL*TNFKICVAVTRLHA*CTHAV*LITNYCR*QD...KKK', HasStopCodon(IUPACProtein(), '*'))

The stop codons are represented by the asterisk (*). This tells the ribosome to stop the building of that specific protein at that site. With that knowledge lets split the sequence into its separate proteins and organize it in a data frame:

aa = protien.split("*")ncov = [str(i) for i in aa]
ncov_len = [len(str(i)) for i in aa]#store the amino acid chains into a df
import pandas as pd
df = pd.DataFrame({'Amino Acids': ncov, 'Lenght': ncov_len })df.head()

And the output should look like this:

Lets look at the longest and shortest Amino Acid Sequences (proteins):

df.nlargest(5, "Length")df.nsmallest(5, "Length")

Longest Sequences:

Shortest Sequences:

Finally lets look into the most common amino acids:

from collections import CounterCounter(protien).most_common(10)
--------------------------------------------------------------------
#expected output[('L', 886),
 ('S', 810),
 ('*', 774),
 ('T', 679),
 ('C', 635),
 ('F', 593),
 ('R', 558),
 ('V', 548),
 ('Y', 505),
 ('N', 472)]

In the end we have taken a covid-19 genome and conducted protein synthesis. In the next tutorial we will take the PDB file of the SARS-CoV-2 spike and do some protein visualization and 3d analysis.

Completed Jupyter Notebook: Bioinformatics/Covid-19 Protien Analysis part 1.ipynb at main · VarunSendilraj/Bioinformatics (github.com)

Part 2: Coming Soon