NBEAM: How I Wrote an Ultra-Fast DNA Sequence Alignment Algorithm in JavaScript

Science. JavaScript. Speed. Can we pick only one?

Unfortunately, the idea that these three concepts are mutually exclusive is a widely-held belief for many developers (and scientists). Bioinformatics packages are written by biologists-turned-developers, frequently in Python and Perl, and are often not optimized for speed (or even usability). JavaScript has a spotty history, riddled with complaints of implementation details (Brendan Eich wrote in in only ten days, c’mon!). It’s common to assume that if you want speed you need to work with the hardware. How can you achieve “close to the metal” speed with JavaScript?

Google’s V8 JavaScript Engine has come a long way. In this article I’m going to explain how I wrote and implemented an ungapped degenerate DNA sequence alignment algorithm in JavaScript (browser + node) using ArrayBuffers and bitwise operators.

A full JavaScript library containing containing the algorithm and other useful features is available in a github repository, keitwhor/NtSeq. Before I go into the nitty-gritty details of the algorithm, NBEAM (Nucleotide Bitwise Exhaustive Alignment Mapping), I would like to clarify some definitions for the biologists and developers alike.

First, to biologists: The algorithm contained here is for matching ungapped sequences and implements no substitution matrix. It scores exact matching only, being 1 for match (degenerate nucleotide or not) and 0 for no match. I’m sure it can be expanded upon to use a more complex scoring system. It’s MIT licenced, so go nuts! Just give credit where credit is due.

For developers: Some lingo!

Nucleotide:
A “letter” of DNA. Typically A, T, G or C. In RNA, U is used instead of T.

Degenerate Nucleotide:
A “letter” representing one or more possible nucleotides. There are 15 possible symbols, 16 if you include no matches (represented as “-”). For more information, you can check out nucleic acid notation on Wikipedia.

Ungapped:
The DNA sequence contains no “gaps” or null nucleotides. While the algorithm itself supports gaps, it doesn’t make any effort to check gapped variants of sequences. It will always count a gap in a provided sequence as “not a match.”

Substitution Matrix:
A way of scoring nucleotide comparisons based upon how likely one nucleotide is to randomly mutate to a different type of nucleotide over a certain amount of time. This type of scoring is not used here, but it’s good to clarify.

Genome:
You’ve probably heard this one. A string of nucleotides containing all of the genetic information of a specific organism.

The Problem

First, let’s consider what we want to do. We have two sequences, let’s call them seqSearch and seqGenome. The problem we’re trying to solve is that we want to find A) the best match of seqSearch in seqGenome and B) all subsequent “close” matches of seqSearch in seqGenome, ranked by similarity, along with the position of where they match.

So, if seqSearch is “ATGC” and seqGenome is “ATGGCATGC”, we would expect a match list that looks like the following (with position 0-indexed):

Rank: 1, Position: 5, Matches: 4 / 4, Result: ATGC
Rank: 2, Position: 0, Matches: 3 / 4, Result: ATGG
Rank: 3, Position: 1, Matches: 2 / 4, Result: TGGC
… etc.

An algorithm to implement this is relatively straightforward:

function search(seqSearch, seqGenome) {
  var sLen = seqSearch.length;
var gLen = seqGenome.length;
  // Create a Uint32Array to initialize all values to 0
var map = new Uint32Array(gLen + sLen);
  var curChar;
var offset;
  for (var j = 0; j < sLen; j++) {
    curChar = seqSearch[j];
    /* As we progress through our seqSearch,
our offset in our map (the position
we're matching at) gets changed.
This is because we're trying to compare
and aggregate match count based on the
alignment of seqSearch as compared to
its first (0-index) position. */
    offset = sLen — j;
    for (var i = 0; i < gLen; i++) {
if (curChar === seqGenome[i]) {
++map[offset + i];
}
}

}
  /* Convert map back into regular array,
so we can use array methods */
return [].slice.call(map);
};
// Call our search
var alignmentMap = search(seqSearch, seqGenome);
// Map position values
alignmentMap = alignmentMap.map(function(value, index) {
return {position: index - seqSearch.length, matches: value};
});
// Sort position values
alignmentMap.sort(function(a, b) {
return a.matches - b.matches;
});

This is dead simple. Looking at aggregating and sorting the results, our map call will run in O(n) and our sort call (with a quicksort implementation) should run with an average time complexity of O(n log n). However, since we have a nested loop comparing every nucleotide of seqSearch with seqGenome, we’re going to run into a time complexity of O(n²) for our search call— yikes!

Let’s take a break right here — the following algorithm makes no attempt to achieve better performance in terms of time complexity. It still runs in O(n²). However, knowing that this is a bottleneck of such exhaustive sequence comparisons, we can try to optimize this simple search algorithm to be as performant as possible.

Initial Implementation Issues

The first problem with the above implementation is that it does not match degenerate nucleotides. For example, if seqSearch contains “W” (which should match “A” or “T”), it will only accurately increase the match counter if seqGenome contains a “W” as well. No good!

The second problem (and it relates to the first) is our conditional statement. Normally, branch prediction (to learn more, read my favorite Stack Overflow answer of all time) would try to speed this up for us:

if (curChar === seqGenome[i]) {
++map[offset + i];
}

If there are no matches in the entire sequence (let’s say all As versus all Ts), you have the potential for a branch predictor to guess that it can skip the command contained within every time, it will always be right, and voila — you’ll run very quickly! (The benchmarks for NtSeq actually show this result for cases of 0% identity between sequences. Neat!)

However, in the case of genomic data, matches are (more or less) randomly distributed, meaning a branch predictor is going to perform very poorly (and always check the conditional statement).

These conditional checks take time. You’re doing string / array accession and character comparison for every single nucleotide in both seqSearch and seqGenome.

What if there were a way to A) match degenerate nucleotides adequately and B) match and count multiple nucleotides in a sequence at once instead of one at a time?

It turns out there’s a way to do both, and JavaScript is very good at it!

Storing Sequence Information Using Binary Data

What if instead of using strings to compare sequence data, we used integers?

First of all, why would we do that?

Let’s start with a quick rundown of integers. All integers in JavaScript are signed 32-bit integers. This means they have 1 sign bit and 31 bits containing the information about the number.

1 represented in binary as a 32-bit integer is:

00000000 00000000 00000000 00000001

I’ve separated the bytes of the 32-bit integer here for readability. As a refresher, every time you set an integer value in JavaScript (i.e. var a = 1;), you’re actually allocating 4 bytes (32-bits) of memory.

Well, neat. It turns out we can use this property of integers to store eight degenerate nucleotide values in one integer.

How, you ask? Remember that for degenerate nucleotides, there are 16 possible symbols (including “no match”). This is, conveniently, 2⁴ different possibilities and can thus be represented using only four bits of data. We can now create some semantics for how to read these four bits of data and interpret them as nucleotides (or degenerate versions of such).

1000 is A (8)
0100 is T (4)
0010 is G (2)
0001 is C (1)

This allows us to do some wonderful things!

To get the values of degenerate nucleotides:
W = (A | T) = (1000 | 0100) = 1100
N = (A | T | G | C) = (1000 | 0100 | 0010 | 0001) = 1111

To match nucleotides:
A & A = (1000 & 1000) = 1000
A & W = (1000 & 1100) = 1000
A & T = (1000 & 0100) = 0000

And so on! If you’re confused, please check up on Bitwise Operators before reading further. You should at least understand OR (|), AND (&), UNSIGNED BITSHIFT RIGHT (>>>) and BITSHIFT LEFT (<<).

Now that we have all our potential nucleotide symbols stored in four bits of data, we can lump strings of eight nucleotides into one 32-bit integer (32 / 4 = 8). “ATGCATGC” becomes 1000 0100 0010 0001 1000 0100 0010 0001, for example.

… Introducing ArrayBuffers!

Luckily, JavaScript has a really, really easy way to deal with binary data. That’s using the ArrayBuffer. An ArrayBuffer is simply an array of bytes that can be interfaced with as (signed or unsigned) 8-bit, 16-bit or 32-bit integers using Uint8Array, Uint16Array and Uint32Array. For simplicity sake I’m not using signed integers in these examples — the “U” preceding the interface names means “unsigned.”

Keep in mind that you can only modify ArrayBuffers using views, or the Uint8Array (etc.) data types. Instantiating a new data view on an ArrayBuffer will not change the underlying data, merely how it is presented. This means we can set the values of an ArrayBuffer using a Uint8Array and convert it to a Uint32Array by simply instantiating a new view on the underlying buffer.

Reading our sequence data into an array of 32-bit integers can now be simplified:

function convertSequenceToArrayOfIntegers(sequenceString) {
  // two nucleotides per byte (4 bits each)
var sequenceBytes = Math.ceil(sequenceString.length / 2);
var sequenceBuffer = new ArrayBuffer(sequenceBytes);
var uint8view = new Uint8Array(sequenceBuffer);
  /*
     lookupTable should be an object mapping each of the 16
nucleotide characters to their respective 4-bit integer
value.
     The following would do this for string "AT"
uint8view[i] starts as 0000 0000 (binary)
uint8view[i] = (1000) << 4 ("A" in binary bit shift left 4)
uint8view[i] is now 1000 0000 (binary)
uint8view[i] |= (0100) ("T" in binary)
uint8view[i] is 1000 0000 and gets OR (|) with 0100,
resulting in 1000 0100

*/
  for (var i = 0, len = sequenceBuffer.length; i < len; i++) {
uint8view[i] = lookupTable[sequenceString[i * 2]] << 4;
uint8view[i] |= lookupTable[sequenceString[i * 2 + 1];
}
  // sequenceBuffer was modified by uint8view
// you can now return it as an array of 32-bit ints

return new Uint32Array(sequenceBuffer);
}

If we run this method on both of our input sequences, we’ll have transformed each of them into arrays of 32-bit integers. We can then use these arrays to perform some very quick nucleotide comparisons.

Comparing Sequences Using Bitwise Operators

Now, the fun stuff! Earlier on I mentioned that we can compare two 4-bit nucleotides for a match using the BIT AND (&) operator.

A & A = (1000 & 1000) = 1000, for example.

In this case, any time you AND two matching nucleotides, you get a sequence of bits that contains at least one 1. When there’s no match, the sequence of four bits will always be 0000.

The wonderful thing about bitwise operators is that (ideally, running on the processor level) they only take one processor cycle (your processor likely runs over 2 billion cycles per second) and act on an entire 32-bit register at once. What does that mean? It means we can compare eight degenerate nucleotides at a time when they’re stored as 32-bit integers, and it only takes a fraction of a nanosecond.

“ATGCATGW & ATATWWNN” would become:

1000 0100 0010 0001 1000 0100 0010 1100 &
1000 0100 1000 0100 1100 1100 1111 1111
=======================================
1000 0100 0000 0000 1000 0100 0010 1100

Awesome! Now how do we count the number of matches in the resulting bit string efficiently?

First, make note that: Comparing two degenerate nucleotides will result in a 4-bit sequence that contains more than one bit. This means we can’t simply count bits for a match. (W & W = (1100 & 1100) = 1100).

However, non-matching nucleotides will always result in 0000.

So what we can do is collect a match flag for every 4-bit sequence. We do this with UNSIGNED RIGHT BIT SHIFT (>>>) and BIT OR (|). We can group every set of four-bits in their rightmost spot indicating a match or no match (whether it’s a 1 or a 0).

// matches = 1000 0100 0000 0000 1000 0100 0010 1100 (binary)
matches |= matches >>> 1;
/*
1000 0100 0000 0000 1000 0100 0010 1100 (matches) |
0100 0010 0000 0000 0100 0010 0001 0110 (matches >>> 1)
=======================================
1100 0110 0000 0000 1100 0110 0011 1110
*/
matches |= matches >>> 2;
/*
1100 0110 0000 0000 1100 0110 0011 1110 (matches) |
0011 0001 1000 0000 0011 0001 1000 1111 (matches >>> 2)
=======================================
1111 0111 1000 0000 1111 0111 1011 1111
*/

You’ll notice that the value of matches, though modified (and looking nothing like our original integer!) now actually has only eight important bits. Those are the rightmost bits of each group of four. Every where there was a match there is a 1, and everywhere there was no match, there’s a 0.

We can clear out the unneeded bits with a simple & with the hex flag 0x11111111. (0001 0001 0001 0001 0001 0001 0001 0001).

matches &= 0x11111111;
/*
1111 0111 1000 0000 1111 0111 1011 1111 (matches) &
0001 0001 0001 0001 0001 0001 0001 0001
=======================================
0001 0001 0000 0000 0001 0001 0001 0001
*/

Beautiful! Now it comes down to counting the number of 1-bits in the bitstring.

There are a number of ways to do this. One is an algorithm that runs in O(n):

function countBits(n) {
var count = 0;
while (n !== 0) {
count++;
n &= (n - 1);
}
return count;
}

There’s also the Hamming weight algorithm referenced on another beautiful StackOverflow answer that runs in O(1):

function countBits(n) {
n = n - ((n >> 1) & 0x55555555);
n = (n & 0x33333333) + ((n >> 2) & 0x33333333);
return (((n + (n >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24;
}

Aha! O(1)! Must be better! The big problem we encounter, however, is that our bit count in our actual integer is relatively sparse.

We only have a maximum of eight bits that could be counted. This means our O(n) solution will take a maximum of about 32 processor cycles (conditional check, increment, bit AND, subtraction — all multiplied by 8 is 32) and minimum of 0 cycles, whereas our O(1) Hamming Weight solution always takes around 14 processor cycles based on the number of operations (rough count, correct me if I’m wrong).

Now to our actual data — in two randomly distributed sequences of DNA, you’re only likely to encounter a match 25% of the time, meaning our O(n) solution will average 8 processor cycles when confronted with real data. This will outperform Hamming Weight.

Cool, so we’ll use that, right?

Not so fast! Turns out there’s a better O(1) solution than Hamming Weight. You can use a lookup table (as an array) to count bits in a bitstring. LookupBits[0] would be 0, LookupBits[1] would be 1, LookupBits[2] would be 1, LookupBits[3] would be 2… etc. The problem with this solution is that it takes a ton of memory. There are 4,294,967,295 different unsigned 32-bit integers, each containing a value in the lookup table between 0 and 32 (up to 3 bytes!) meaning the lookup table would be 12GB in size. Ugh.

  1. You probably don’t have 12GB of memory to allocate
  2. Even if you did, a lookup table this size is unmanageable. Cache misses will slow down your process to a screeching halt.

So how is this usable? Well, we know we only actually have eight important bits. Let’s look at our resultant bit string again:

0001 0001 0000 0000 0001 0001 0001 0001

Every important bit is contained in the rightmost grouping of every four bits. There are only eight of them, too. Meaning there’s only actually one byte of relevant data and 256 possible combinations of results. Turns out we can use bit shifts again to aggregate our data to the rightmost byte!

// matches = 0001 0001 0000 0000 0001 0001 0001 0001
matches |= matches >>> 3;
/*
0001 0001 0000 0000 0001 0001 0001 0001 (matches) |
0000 0010 0010 0000 0000 0010 0010 0010 (matches >>> 3)
=======================================
0001 0011 0010 0000 0001 0011 0011 0011
*/
matches |= matches >>> 6;
/*
0001 0011 0010 0000 0001 0011 0011 0011 (matches) |
0000 0000 0100 1100 1000 0000 0100 1100 (matches >>> 6)
=======================================
0001 0011 0110 1100 1001 0011 0111 1111
*/

You’ll notice the variable storing our match data, though looking much different again indeed, has now aggregated match data to the rightmost four bits of each set of 16 bits.

0001 0001 0000 0000 0001 0001 0001 0001

has become:

0001 0011 0110 1100 1001 0011 0111 1111

We can now shift 1100 over 12 bits, clear out the unimportant bits, and BIT OR (|) it with 1111 on the right:

matches = ((matches >>> 12) & 0xF0) | (matches & 0xF)
/*
First:
0000 0000 0000 0001 0011 0110 1100 1001 (matches >>> 12) &
0000 0000 0000 0000 0000 0000 1111 0000 (0xF0)
=======================================
0000 0000 0000 0000 0000 0000 1100 0000

Then:
0001 0011 0110 1100 1001 0011 0111 1111 (matches) &
0000 0000 0000 0000 0000 0000 0000 1111 (0xF)
=======================================
0000 0000 0000 0000 0000 0000 0000 1111
  Finally:
0000 0000 0000 0000 0000 0000 1100 0000 |
0000 0000 0000 0000 0000 0000 0000 1111
=======================================
0000 0000 0000 0000 0000 0000 1100 1111
*/

Awesome! We now have all of our match data in the rightmost byte, and can check it in a lookup table that only has 256 entries! (Phew! You can pre-generate this lookup table as a Uint8Array when your program loads using either Hamming Weight or our sparse bitcount example above.)

return bitCountLookup[matches];

Great, now we can count matches of nucleotides eight comparisons at a time. All that’s left to do is rewrite our initial search function.

Final Implementation

If you’ve been following through so far, we can now go back to our first string comparison implementation but replace our conditional check with some bit operations to run the full algorithm. I won’t go through the full explanation, but for every set of two integers we have to make sure we “walk” the integers over each other (bit shifting left and right appropriately) to make sure we catch the potential matches at every position, and store the matches relative to the offset.

The following code is the full excerpt from the NtSeq.MatchMap in my NtSeq Library. I’ve commented parts that may be confusing. The wording and variables names are a bit different than above.

MatchMap.prototype.__countMatches = function(int, bitCount) {
int |= int >>> 1;
int |= int >>> 2;
int &= 0x11111111;
int |= int >>> 3;
int |= int >>> 6;
return bitCount[((int >>> 12) & 0xF0) | (int & 0xF)];
};
MatchMap.prototype.__execute = function(queryBuffer, searchSpaceBuffer) {

var queryInts, spaceInts, queryIntsLength, spaceIntsLength, arrLen, mapBuffer, mapArray, A, B, A1, A2, T, cur, pos, move, i, k, adjustNeg, adjustPos, fnCountMatches, bitCount;

queryInts = new Uint32Array(queryBuffer, 4);
spaceInts = new Uint32Array(searchSpaceBuffer, 4);
fnCountMatches = this.__countMatches;
bitCount = __bitCount;
  queryIntsLength = queryInts.length|0;
spaceIntsLength = spaceInts.length|0;
arrLen = (queryIntsLength + spaceIntsLength) << 3;
mapBuffer = new ArrayBuffer(4 * arrLen);
mapArray = new Uint32Array(mapBuffer);
  for (k = 0|0; k < queryIntsLength; k++) {
    A = queryInts[k];
    // set offset, x << 3 is shorthand for x * 8
// because there are 8 nucleotides per int
    cur = (queryIntsLength — k) << 3;
    for (i = 0|0; i < spaceIntsLength; i++) {
      // if match without shifting is non-zero, count matches
      (T = A & spaceInts[i]) &&
(mapArray[(i << 3) + cur] += fnCountMatches(T, bitCount));
}
    // start walking "A" along searchSpace (seqGenome) by
// bitshifting left and right, adjust offsets accordingly
    A1 = A >>> 4;
A2 = A << 4;
    adjustNeg = cur — 1;
adjustPos = cur + 1;
    // break loop if A1 and A2 have been shifted far enough
// to zero them both out
    while(A1 || A2) {
      for (i = 0|0; i < spaceIntsLength; i++) {
B = spaceInts[i];
pos = (i << 3); // === i * 8
        // if the match result is non-zero, count matches.
        (T = A1 & B) &&
(mapArray[pos + adjustNeg] += fnCountMatches(T, bitCount));
(T = A2 & B) &&
(mapArray[pos + adjustPos] += fnCountMatches(T, bitCount));
}

// keep "walking" / shifting current integer to each offset
      A1 >>>= 4;
A2 <<= 4;
--adjustNeg;
++adjustPos;
    }
}

// return our buffer, we can instantiate a Uint32Array
// on it if we wish

return mapBuffer;
};

Now we’ve gotten somewhere. ☺

Benchmarks

So how does this actually perform? Turns out pretty well. While it has the same time complexity, it runs 6–7x faster than the “simple” implementation. Here’s some data taken February 7th, 2015:

Benchmark         |  naive | search |   naiveScore |  searchScore
------------------------------------------------------------------
1,000,000, 0% | 9ms | 3ms | 9.00ns/nt | 3.00ns/nt
10,000,000, 0% | 63ms | 5ms | 6.30ns/nt | 0.50ns/nt
100,000,000, 0% | 621ms | 60ms | 6.21ns/nt | 0.60ns/nt
1,000,000, 25% | 15ms | 6ms | 15.00ns/nt | 6.00ns/nt
10,000,000, 25% | 124ms | 17ms | 12.40ns/nt | 1.70ns/nt
100,000,000, 25% | 1249ms | 233ms | 12.49ns/nt | 2.33ns/nt
1,000,000, 50% | 15ms | 2ms | 15.00ns/nt | 2.00ns/nt
10,000,000, 50% | 131ms | 20ms | 13.10ns/nt | 2.00ns/nt
100,000,000, 50% | 1305ms | 234ms | 13.05ns/nt | 2.34ns/nt
1,000,000, 100% | 14ms | 2ms | 14.00ns/nt | 2.00ns/nt
10,000,000, 100% | 144ms | 18ms | 14.40ns/nt | 1.80ns/nt
100,000,000, 100% | 1471ms | 240ms | 14.71ns/nt | 2.40ns/nt

“naive” represents the first version of the alignment algorithm listed here (basic string comparison), and “search” represent the full implementation using bitwise operators. The score is measured in nanoseconds per nucleotide comparison, and was performed on a 2.4GHz processor.

The trial titles represent total search space (seqSearch length multiplied by seqGenome length) and average sequence identity (match) between the two sequences.

Conclusions

Well, JavaScript performs this optimized bit op algorithm pretty quickly.

As mentioned, the V8 engine in node.js averages about 5 processor cycles per nucleotide comparison (including accession, storage, aggregation of data) which is about as “close to the metal” as you can get.

Personally, I think it’s time we start taking a look at using JavaScript more seriously as a “heavy-lifting” language. The V8 team at Google has done a fantastic job at making their JavaScript engine performant. As far as scientific computing is concerned, the potential is there. JavaScript is extremely accessible (requiring only a browser and text editor to execute) and can be introduced to young scientists and bioinformaticians easily. There are also arguments to be made for scaling node processes, but that can be saved for another day.

Finally, the NBEAM algorithm as described here, though implemented in JavaScript, is actually language-agnostic. Please feel free to modify it to suit your needs (I’d be interested to see GPU implementations!), but credit is always appreciated. If anybody can aid in verifying the novelty of NBEAM and would like to help get it published, don’t hesitate to reach out!

As a final reminder, the NtSeq JavaScript library and full implementation of the NBEAM algorithm is available at https://github.com/keithwhor/NtSeq.

Contact

You can visit my personal website at keithwhor.com, follow me on twitter @keithwhor, visit my github or just shout out “hi” if you happen to see me walking the streets of San Francisco.

Thanks for the read!