Biopython to Run Bioinfromatics Tools and Protein Sequences Alignment
In our previous tutorial we described how to deal with different biological file formats (Fasta, Genbank etc.), how to search and download the data from biological databases (ExPASy, NCBI databases) with Biopython modules. Finally we fetched 44 amino acid sequences of tumor protein p53 from mammalian species and wrote them into tp53.fa file. If you did all the steps from previous tutorial you can use your output file for this pipeline or you can copy it from /data/userXXX/tutorials_data/ directory (userXXX is your user name). In this tutorial we will show how to perform different bioinformatics procedures with our protein sequences of tumor protein p53. Most of our steps will be performed with Biopython functions, so it is necessary to have a basic knowledge of Python programming language syntax.
Firstly, we need to open the terminal window by clicking “SHOW TERMINAL” button.
Type the following commands “mkdir biopython2” and “cd biopython2” to create and move into working directory. Now copy the file tp53.fa from /data/userXXX/tutorials_data/or /data/userXXX/biopython/directories (cp [directory]/tp53.fa ./).
Now we should run a ClustalW2 tool from EMBOSS package to create a multiple alignment of tp53 protein sequences:
isub -c 2 -r 7.5 -t clust -e '/srv/dna_tools/clustalw/clustalw2 -TYPE=PROTEIN -INFILE=/data/userXXX/biopython2/tp53.fa -OUTFILE=/data/userXXX/biopython2/tp53.aln -OUTPUT=CLUSTAL'
where -TYPE=PROTEIN specifies import sequence type, -OUTFILE and -OUTPUT specify output file and output file format.
You will see 2 output files tp53.aln and tp53.dnd in your working directory (ls).
tp53.aln file contains multiple alignment in clustalw file format (less).
All the following python commands you can run locally in the python command line (python) or you can create a new script with all these commands written one after another and run it locally (python [script_name].py) or submit the script as a task:
isub -c 2 -r 7.5 -t scr1 -e 'python
For the last option you need to specify the full path for all input/output files or directories. Biopython provides many opportunities for changing file formats. For example you can run:
from Bio import AlignIO
align = AlignIO.read("/data/userXXX/biopython2/tp53.aln", "clustal")
to read an alignment file into SeqIO object and
from Bio import SeqIO
SeqIO.write(align, "/data/userXXX/biopython2/out.fa", "fasta")
to write the alignment into out.fa file in Fasta format.
tp53.dnd file contains phylogenetic tree that was built based on final alignment. You can open (less) this file, however, it is more convenient to red it in the following way:
from Bio import Phylo tree = Phylo.read("/data/userXXX/biopython2/tp53.dnd", "newick")
You can see that this tree looks quite similar to the real mammalian phylogenetic tree.
Now let’s try to run BLAST from python script. There are two possible options: you can run BLAST locally with your own database (Bio.Blast.Applications module) or run it over the Internet (Bio.Blast.NCBIWWW module). We will use the second option to find homologs of one of our tp53 proteins to illustrate this process. Use help command to read about qblast command line options:
from Bio.Blast import NCBIWWW
record = list(SeqIO.parse("/data/userXXX/biopython2/tp53.fa", "fasta"))
read alignment file,
result_handle = NCBIWWW.qblast("blastp", "swissprot", record.seq)
run protein BLAST over SwissProt database,
from Bio.Blast import NCBIXML
blast_record = NCBIXML.read(result_handle)
NCBIXML.read command allows to parse BLAST XML output. Firstly, we need to specify E-value threshold:
E_VALUE_THRESH = 0.05
and then to iterate through all BLAST hits to filter out alignments with e-value (alignment.hsps)> 0.05 and print BLAST hit’s e-value, length and SwissProt id:
for alignment in blast_record.alignments: for hsp in alignment.hsps: print('name:', alignment.title) print('length:', alignment.length)
To see the query and BLAST hits alignments use these commands in the previous for loop:
You can see that most of the hits are Tumor proteins p53, p73, p63 from different species. You may search each of these proteins (using id) in SwissProt database to get an additional information, for example:
from Bio import ExPASy from Bio import SwissPort handle = ExPASy.get_sprot_raw("Q95330.1") seq_record = SeqIO.read(handle, "swiss") print(seq_record.id) print(seq_record.description)
Let’s try some more actions that we can do with our alignment. First of all, let’s calculate the summary information about our alignment. To make it we need to specify a new alphabet for alignment object which contains not only amino acids, but also gaps:
from Bio import Alphabet from Bio.Alphabet import IUPAC from Bio.Align import AlignInfo filename = "tp53.aln" alpha = Alphabet.Gapped(IUPAC.protein) c_align = AlignIO.read(filename, "clustal", alphabet=alpha)
summary_align = AlignInfo.SummaryInfo(c_align)
Alignment summary object can be further used for different purposes. For example we can calculate a Position Specific Score Matrix (PSSM) from our alignment. PSSM is a count matrix, where for each column in the alignment the number of each alphabet letters is counted and totaled.
consensus = summary_align.dumb_consensus() pssm = summary_align.pos_specific_score_matrix(consensus, chars_to_ignore = ['-'])
Each column corresponds to one of 20 amino acids, each row corresponds to alignment positions. The numbers in the matrix are equal to the number of times when specific amino acid was observed at specific position. Xs denote positions with no clear consensus.
Also we can print this consensus without matrix:
Finally we can calculate a substitution matrix from alignment summary object. Substitution matrix for protein alignments is a 20 aa * 20 aa matrix where each position is proportional to the expected probability of this specific substitution.
replace_info = summary_align.replacement_dictionary() from Bio import SubsMat my_arm = SubsMat.SeqMat(replace_info) my_lom = SubsMat.make_log_odds_matrix(my_arm)
As you can see diagonal values are positive because of the high probability to see the same amino acid in the same position of another homologous protein. ND values mean that there is not enough data to estimate a probability of this substitution from alignment.
You may also interested in
Originally published at insidedna.me.