GPU Accelerated Sequence Matching

GPU accelerated Neeldman Wunsch Algorithm

Anuradha Wickramarachchi
The Bioinformatics Press
3 min readFeb 11, 2018

--

In this article I will be introducing you to the Needleman Wunsch algorithm and its implementation on a CUDA powered NVIDIA GPU.

CUDA Computing

Introduction

Needleman Wunsch algorithm is used to match sequences in a pairwise manner. Also this algorithm is used as a subroutine in searching for a DNA sequence in a large database. This is a dynamic programming approach where a scoring scheme is used.

Source: Link

As you can see in the image match scores 1, mismatch scores -1 and a gap scores -1. Matches are equal nucleotides overlapping and mismatches are different nucleotides overlapping at the same position. Gap is introduced when one sequence skips a match of a mismatch state by just leaving a gap. The final motive of the dynamic programming approach is to obtain the alignment with the highest aggregate score. In this article I will assume that you know what is dynamic programming and how to execute a CUDA code. You can refer to this article if you need to know more.

Dynamic Programming Approach

Moving either Right or Down will introduce a Gap. Moving diagonally will either create a Match (If the row and column has same letter) or a Mismatch. Therefore, the score of a particular cell is the sum of the parent cell’s score and the score of the movement (Match, Mismatch or Gap).

Computation

As you can see, first row and first column becomes inevitable Gaps. Therefore, the score keeps decrementing by one. For clarity of the dynamic step consider the bottom right corner cell. The maximum possible value for that cell is 0, obtained by traversing diagonally. This type of dynamic programming calculations can be GPU optimized by running calculations concurrently for rightward diagonals as depicted in the image below.

Source: Link

In this case, we can calculate cells with same number concurrently. Altogether, we just have to run the algorithms X+Y times where X and Y are the lengths of the two sequences. This can be further improved by tiling the matrix and computing with shared memory. Let’s keep that for future. If this was to be done in normal manner we would need X * Y number of calculations. For a sequences of length 1000bp its a gain of nearly 500 times. Lets see the CUDA code for this computation. Before that a small not on backtracking.

Backtracking

We must build the alignment by backtracking parents of each cell. There could be several parents as well. Having parents for each of the cell will drastically reduce the performance of the calculation and reduce performance gain of CUDA computation. Therefore what I have done is simply traversed from the bottom right corner to initial cell. For this a simple Depth First Search (DFS) can be done with path replication. This allows building the complete set of paths. To make it easy for you I have included all the support headers I have implemented and complete code. The complete repository with few other algorithms can be found in this link, i hope you won’t mind giving me a star ;-) link.

--

--