# Needleman-Wunsch algorithm for DNA sequence alignment

--

Hi everyone!

DNA, or deoxyribonucleic acid, is a double helix molecule made out of four nucleotide bases, namely Adenine (A), Guanine (G), Thymine (T) and Cytosine (C), that always pair together in same way, A with T; C with G. ‘Genome’ is a complete set of DNA of an organism, which contains instructions that allow an organism to develop and grow. The human genome is 3.2 billion bases long.

Have you ever wondered how much similar your DNA is to your parent’s DNA or to someone else’s DNA?

## What is Sequence Alignment?

Sequence alignment is a method of comparing sequences like DNA or protein in order to find similarities between two or more sequences. This will provide you with an answer to the question: whether two sequences have evolved from a common ancestor or not. It is useful in determining evolutionary relationships between different species. There are two types of pairwise alignment methods,

1. Global alignment — This is suitable to compare two sequences across their entire length. Needleman-Wunsch algorithm (1970) is used for optimal global alignment.
2. Local alignment — This is suitable to identify local similarities between two sequences , useful when sequences are very distant and when one sequence is significantly shorter than the other. Smith-Waterman algorithm (1980) is used for optimal local alignment.

Additionally, more than two sequences can be aligned using multiple sequence alignment methods. In this post, we are going to look at the alignment between two sequences.

## Global alignment

Consider the two sequences: X= ATGCTAGT and Y= ATGTTAT

Two sequences can be aligned in numerous ways. But during sequence alignment we seek for the one that captures genuine similarities. The degree of similarity or differences between two sequences has to be measured. How to determine the sequence which has maximum matches? What is the measure of similarity between two sequences?

We can come up with a measure by defining a scoring scheme/function. A simple scoring scheme is to use +1 score for matches, -1 for mismatches and 0 for gaps. Different scoring schemes are used in real world sequence alignment problems. The substitution matrix/similarity matrix for a simple scoring function will look like this:

Consider these alignments!

By comparing these alignment scores, we can determine the best sequence alignment to be the one with the maximum score.

A gap in an alignment is introduced by deletion or insertion of a base. Note that in previous examples we didn’t consider the gap penalties. But in real scenario, the penalty for gap should be higher than mismatch score. Because a gap shifts the reading frame. There are several gap penalty functions such as constant gap penalty, linear gap penalty, affine gap penalty etc.

Alignment score tells us which alignment is the best alignment; In other words, which alignment has the highest alignment score, but it doesn’t tell us how to find the best alignment. Obviously, we can generate all possible alignments and calculate their score individually. But! Do you know the number of unique global alignments possible for two N length sequences? It can be determined by the equation given below,

If we have two sequences with 12 residues, then we should consider around 1 million possible unique alignments and calculate their alignment scores individually to determine the best alignment. Still do you think it’s possible to find every alignment, calculate its scores and then determine the best alignment? I would say No!

This is where dynamic programming comes into play!

Dynamic programming was invented by Richard Bellman in 1953. If a problem can be solved by breaking it into simpler sub problems, then dynamic programming methods are applicable to solve them. Overlapping subproblems and optimal substructure are the two attributes a problem should have in order to apply Dynamic programming methods. This is widely used in the field of bioinformatics and computational biology. Needleman-Wunsch algorithm is an application of dynamic programming. This was invented by Saul Needleman and Christian Wunsch in 1970. Basically, Alignment of two long sequence is done using optimal alignment of their prefixes.

Let’s start coding,

The Needleman-Wunsch algorithm requires two matrices: score matrix and traceback matrix. The algorithm consists of following steps:

## 1. Initialization of matrices

This is how both matrices look like after initialization, where the linear gap penalty = -1 is used.

## 3. Deduce the best alignment from traceback matrix

Traceback begins with the bottom right-most cell (last cell to be filled). Move according to the value in the cell until ‘done’ cell is reached.

## How to interpret the best alignment from above matrix?

The cell value ‘diag’ interprets that residues from two sequences are aligned, ‘up’ can be interpreted as a gap added in top sequence or insertion. Similarly, ‘left’ can be interpreted as a gap added in left sequence or deletion.

This is the optimal alignment derived using Needleman-Wunsch algorithm.

Protein sequence alignment is more preferred than DNA sequence alignment. Because DNA sequences are made of only 4 bases (A, G, C, T), while protein sequences are made of 20 amino acid residues. It is less likely to get a match by chance in protein sequence alignment.

Protein sequence alignments doesn’t vary much from DNA sequence alignment. Unlike DNA, proteins have 20 bases. The only difference between DNA alignment and protein alignment is the substitution matrix. Having weighted scores is important in protein sequence alignment. There are two widely used families of substitution matrices for protein alignment. Those are,

1. PAM

2. BLOSUM

Finally, Studying the DNA is an important aspect in biology. DNA sequence analysis has become a major research topic nowadays and is applied in many fields such as Forensic Biology, Biotechnology, Virology and so on. This article would’ve given you the general idea about comparing two whole sequences.