A Full Understanding of Pairwise Alignment with Biopython

Comparison with EMBOSS Needle Tools for Global and Local Alignments

Eugenia Anello
Nov 26, 2020 · 7 min read
Image for post
Image for post
Figure 1: Photo by geralt on pixabay

Biopython is an international association of developers of freely available Python tools for computational molecular biology. In particular, it allows to deal with Sequence Alignments, that are methods able to compare and detect similarities between Biological Sequences. Sequence similarity means that the sequences compared have similar or identical residues at the same positions of the alignment.

In this article, I’m going to focus on the Pairwise Alignment. My goal is to replicate the same results obtained with Biopython with respect to Emboss Needle, an efficient webserver that provides many tools for sequence alignment. When we need to handle a huge volume of data, it’s preferable to have a good knowledge of Biopython to obtain the output faster than manually doing it on a website. For prior knowledge, I suggest you read my previous article, in which I explained the concepts of Similarity, Identity, Global and Local Alignments in a simple way. These concepts will be useful while I try to apply them with Biopython. Before talking about pairwise alignment, I’m going to introduce the standard sequence class that deals with sequences.

Biological Sequences are surely the central object in Bioinformatics and are essentially strings of letters like AGTACACTGGT, which is the most common way that sequences are seen in biological file formats. Biopython deals with the sequences through the Seq object. In most ways, we can deal with Seq objects as if they were normal Python strings, for example getting the length, or iterating over the elements:

from Bio.Seq import Seqmy_seq = Seq("ASEIELV")for i, letter in enumerate(my_seq):
print("%i %s" % (i, letter))
'''
0 A
1 S
2 E
3 I
4 E
5 L
6 V
'''
len(my_seq)#7my_seq[4:12]
#Seq('ELV')

Although Seq object supports different methods from normal Python strings:

  • obtain the complement or reverse complement of a Seq object
  • do the transcription
  • translate mRNA into the corresponding protein sequence

For nucleotide sequences, we can easily obtain the complement or reverse complement of a Seq object using the methods complement and reverse_complement:

from Bio.Seq import Seq
my_seq = Seq("GATCGAT")
print(my_seq.complement())
#CTAGCTA
print(my_seq.reverse_complement())
#ATCGATC

To transcribe the coding strand into the corresponding mRNA, we’ll use the transcribe method:

coding_dna = Seq("ATGGCCATT")
messenger_rna = coding_dna.transcribe()
print(messenger_rna)
#AUGGCCAUU

Now let’s translate this mRNA into the corresponding protein sequence using the method translation:

coding_dna.translate()
Seq('MAI')

The SeqRecord allows to hold a sequence (as a Seq object) with identifiers (ID and name), description and optionally annotation and sub-features. It is the basic data type for the Bio.SeqIO sequence input/output interface. In the example below we can see the informations offered by SeqRecord Object:

  • seq: The sequence itself, typically a Seq object
  • id: The primary ID used to identify the sequence
  • name: a common name/id for the sequence
  • description: a “human” readable description
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
record = SeqRecord(
Seq("MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF"),
id="YP_025292.1",
name="HokC",
description="toxic membrane protein, small",
)
print(record)
'''
ID: YP_025292.1
Name: HokC
Description: toxic membrane protein, small
Number of features: 0
Seq('MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF')'''

The pairwise alignment is the process of aligning two sequences to each other by optimizing the score between them. Bio.pairwise2 is the module for these type of alignments and contains the same algorithms as water (local) and needle (global) from the EMBOSS suite and returns the same results if we use the same default parameters.

The names of the alignment functions in this module follow the convention <alignment type>XX where <alignment type> is either “global” or “local” and XX is a 2 character code indicating the parameters it takes. The first character indicates the parameters for matches (and mismatches), and the second indicates the parameters for gap penalties.

The match parameters are:

  • x = No parameters. Identical characters have a score of 1, otherwise 0.
  • m = A match score is the score of identical chars, otherwise mismatch score.
  • d = A dictionary returns the score of any pair of characters
  • c = A callback function returns scores

The gap penalty parameters are:

  • x = No gap penalties
  • s = Same open and extend gap penalties for both sequences
  • d = The sequences have different open and extend gap penalties
  • c = A callback function returns the gap penalties

All alignment functions have the following arguments:

  • Two sequences: strings, Biopython sequence objects or lists. Lists are useful for supplying sequences which contain residues that are encoded by more than one letter.
  • penalize_end_gaps: boolean. Whether to count the gaps at the ends of an alignment. By default, they are counted for global alignments but not for local ones. Setting penalize_end_gaps to (boolean, boolean) allows you to specify for the two sequences separately whether gaps at the end of the alignment should be counted.
  • gap_char: string (default: '-'). Which character to use as a gap character in the alignment returned. If your input sequences are lists, you must change this to ['-']
  • open: gap penalty when a gap is opened. It should be negative.
  • extend: gap penalty when a gap is extended. It should be negative.

The Global Alignment is a method that aligns and compares two sequences along with their entire length and comes up with the best alignment that displays the maximum number of nucleotides or amino acids aligned.

Example 1:

Image for post
Image for post
Figure 2: Photos by Anna Tarazevich and Pixabay on Pexels

Let’s align the Hemoglobin of a human and the Hemoglobin of a Gorilla. Since humans are close relatives with monkeys, the two protein sequences should match very well and have a high percentage of identity. I created a file named gorilla_hum.fa with the two sequences:

>sp|P69905|HBA_HUMAN Hemoglobin subunit alpha OS=Homo sapiens OX=9606 GN=HBA1 PE=1 SV=2
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHG
KKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTP
AVHASLDKFLASVSTVLTSKYR
>sp|P01923|HBA_GORGO Hemoglobin subunit alpha OS=Gorilla gorilla gorilla OX=9595 GN=HBA PE=1 SV=1
VLSPADKTNVKAAWGKVGAHAGDYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK
KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA
VHASLDKFLASVSTVLTSKYR

The Hemoglobin sequences are taken from UniProtKB/Swiss-Prot, that is a high quality manually annotated and non-redundant protein sequence database. The human sequence is in https://www.uniprot.org/uniprot/P69905.fasta, while the gorilla’s sequence is in https://www.uniprot.org/uniprot/P01923.fasta.

The function SeqIO.parse is used to parse the file in format FASTA containing the sequences and we convert the output of the function into a list.

from Bio.Seq import Seq
from Bio import SeqIO
from Bio import pairwise2
from Bio.pairwise2 import identity_match
from Bio.Align import substitution_matrices

seq_records = list(SeqIO.parse("./data/alignments/gorilla_hum.fa", "fasta"))

The goal is to obtain the global alignment between the first two sequences using the function globalds where d stands for the dictionary that returns the score of any pair of characters and s is used to have same open and extend gap penalties for both sequences. So we select the first two elements of the list and we extract only the Seq object, not the other informations like id, name, description.

blosum62 = substitution_matrices.load("BLOSUM62")seq1=seq_records[0].seq
print(len(seq1))
seq2=seq_records[1].seq
print(len(seq2))
alignment = pairwise2.align.globalds(seq1,seq2, blosum62, -10, -0.5,penalize_end_gaps=False)[0]
print(alignment)
Image for post
Image for post
Figure 3: Global alignment between the first two sequences

The length of the alignment is 142, calculated making the difference between the end position and the start position. Thus, end-start=142–0=142. Now we can try to calculate the identity percentage between the two aligned sequences, so we want to know the number of residues presented at corresponding positions in two sequences being compared.

identical_residues=sum([True for i in range(len(alignment.seqA)) if (alignment.seqA[i]==alignment.seqB[i] and alignment.seqA[i]!='-' and alignment.seqB[i]!='-')])print('The number of identical residues is: {}'.format(identical_residues))print('The length of the alignment is: {}'.format(alignment.end-alignment.start))print("The identity percentage between seq1 and seq2 is: {}/{}={:.3f}".format(identical_residues,alignment.end-alignment.start,identical_residues/(alignment.end-alignment.start)))
Image for post
Image for post

To validate the results, let’s look at what we can obtain from the webserver EMBOSS Needle.

Image for post
Image for post
Figure 4: Insert the two sequences in Emboss Needle using Global Alignment tool
Image for post
Image for post
Figure 5: Results of Emboss Needle

There are the equivalent results obtained with Biopython.

The Local Alignment is an algorithm that finds the most similar regions in two sequences being aligned. So it’s useful for sequences that aren’t similar in character or in length, but they still can have in common biologically important motifs.

Example 2:

Let’s continue the other example applying the local alignment through the function localds.

alignment = pairwise2.align.localds(seq1,seq2, blosum62, -10, -0.5,penalize_end_gaps=False)[0]
print(alignment)
Image for post
Image for post
Figure 6: Local alignment between the two sequences

We can observe that the start position is 1 instead of 0. So the length of the alignment will be 142–1=141, one unit smaller than global alignment’s length. Now let’s calculate as before the identity percentage of the local alignment:

identical_residues=sum([True for i in range(len(alignment.seqA)) if (alignment.seqA[i]==alignment.seqB[i] and alignment.seqA[i]!='-' and alignment.seqB[i]!='-')])print('The number of identical residues is: {}'.format(identical_residues))print('The length of the alignment is: {}'.format(alignment.end-alignment.start))print("The identity percentage between seq1 and seq2 is: {}/{}={:.3f}".format(identical_residues,alignment.end-alignment.start,identical_residues/(alignment.end-alignment.start)))
Image for post
Image for post

As before, we compare the result with the webserver Emboss Needle, in which we obtained:

Image for post
Image for post
Figure 7: Local alignment using Emboss Needle

The outputs show that the percentages of identity and similarity are higher with local alignment because the length of the alignment is one unit smaller. Moreover, the similarity percentage is 100%, then we are very closely related to Gorilla!

Final thoughts: In the article, I tried with only 2 sequences, but it’s possible to try with many other protein sequences, you can try pairwise alignments with more data. I recommend you to use Pycharm, an integrated development environment (IDE) used in computer programming, specifically for the Python language. This software makes it easier when you have many files to handle and to use the terminal at the same time.

I hope you find this article useful.

References:

http://biopython.org/DIST/docs/tutorial/Tutorial.html

The Computational Biology Magazine

Share your knowledge, research and ideas related to…

Eugenia Anello

Written by

I am a Data Science student and a Traveller enthusiast | I learn something new everyday | https://www.linkedin.com/in/eugenia-anello-545711146

The Computational Biology Magazine

Share your knowledge, research and ideas related to Computational Biology

Eugenia Anello

Written by

I am a Data Science student and a Traveller enthusiast | I learn something new everyday | https://www.linkedin.com/in/eugenia-anello-545711146

The Computational Biology Magazine

Share your knowledge, research and ideas related to Computational Biology

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store