Marrying DNA Alignment Algorithms with Neural Networks

3 min readSep 23, 2020

Statistical classifiers are mathematical models that use example data to find patterns in features that predict a label. Most statistical classifiers assume the features are arranged into rows and columns, like a spreadsheet, but many kinds of data do not conform to this structure. Sequences are one example of non-conforming data, which is why this data is usually stored in a text document, not a spreadsheet. To avoid forcing data through the wrong analytical tools, we need new approaches for statistical classifiers, which is why we developed dynamic kernel matching (DKM).

Before we can develop approaches to handle non-conforming data, we need to understand it. Let us start by considering sequences. The essential property of a sequence is that both the content and the order of the symbols in the sequence convey information. Sequence data is non-conforming because some sequences are longer than others, resulting in irregular numbers of features. Even when sequences are the same length, a pattern of symbols shared between these sequences can appear at different positions, preventing the same feature from appearing at the same position across all samples.

To run a statistical classifier on a sequence, the challenge is to determine the appropriate features for each weight. When matching features to weights, we want the features to remain invariant to the order of the information in the sequence (that is to say, the order of information is important). Given the immense number of possible arrangements between features and weights, the problem appears computationally complex. However, it turns out there is a family of algorithms for efficiently solving these sorts of problems, collectively known as sequence alignment algorithms [1]. Sequence alignment algorithms are tools for uncovering patterns in nucleic acid and protein sequences. Given how useful they have proven to be to the field of biology, it should not be surprising that they may prove useful to other fields as well. By defining the similarity between features and weights as the inner product, we can use a sequence alignment algorithm to match features to weights to select the permutation of features and weights that exhibit the maximal response to the data.

There are many kinds of non-conforming data other than sequences. Sets are another example of non-conforming data. Sets are like sequences except the order of the symbols does not matter. When confronted with sets instead of sequences, we can use algorithms that solve the assignment problem instead of sequence alignment algorithms [2]. Think of algorithms for solving the assignment problem as essentially doing the same thing as a sequence alignment algorithm, but for sets. In fact, equivalents to sequence alignment algorithms exist for (i) sets, (ii) trees, and (iii) graphs, making it possible to use DKM on non-conforming features represented by these structures (Unlike sequence alignment, the general problem of graph alignment is considered NP-hard). Thus, we use DKM for many kinds of non-conforming data, swapping out a sequence alignment algorithm with the appropriate algorithm to match features to weights.

I have implemented the Needleman-Wunsch algorithm [3] in TensorFlow as a Keras layer [4,5]. We find that it achieves competitive results on biological datasets, outperforming other approaches for classifying sequences. It should be straightforward to adapt the code to PyTorch given its simplicity. To give this approach a try, download the following Keras layer and associated script [4,5].

Written by Jared Ostmeyer