Burrows Wheeler Transform

1. Introduction:

Sandesh Bhusal
AlgoPods
3 min readFeb 18, 2019

--

Taking a small detour from the current series on string comparison, we are going to be looking at the Burrows Wheeler Transform algorithm, that forms the heart of many compression engines in today’s technology stacks.

The Burrows Wheeler Transform tries to bring together the characters in a string that are the same, i.e. clusters the common characters together. This can be specially important in cases like DNA sequencing, where there are limited alphabets from { A, T, C, G } only. For an example, the BWT transform of “ATCGATGCGCATCCGAT$” will be “TGC$GGTCTGCCCTAAAA”. Here, we can see that the common symbols tend to clump together, like the four A’s at the end of the sequence. BWT is not a compression algorithm in itself, but helps in many compression algorithms.

2. Implementation :: Creating a BWT of a given string:

For a Burrows Wheeler Transform, we take the string S[0…n] and append a delimiter ‘$’ at the end. Delimiters can be inserted at the beginning, end, both or none. The delimiter sign will be the sign of lowest weight, i.e. value. It can also be considered as a character of the highest weight. It can be altogether omitted too.

We rotate the string, and generate a table of rotations. The table is sorted in a lexicographical order and a string is created from all the last characters in the strings from the sorted table. BWT transform of the string is obtained. Let’s see this in action.

For a string ‘apple’, we have:

The sorted table becomes:

Remember, that the $ character precedes all the other characters. This is what I meant earlier when I said it has the lowest weight.

From the table, we can see that the BWT transform of the string will become “e$lppa”. Along with the transformed string, we take the position of the original string i.e. 2 in the table, which will be used in decoding the BWT transform later on.

Fig: Implementation of Burrows Wheeler. Start from the leftmost column and proceed towards the right -> . Here ^ denotes the beginning of the string and | denotes the end of the string. Sorting can take into account the string delimiters, ^ and | as characters of either highest priority or the lowest priority.

2. Decoding a BW Transformation of a string.

Now that we have created a BWT transformation of the string, it is time to do stuff like compression with it. For an example, say, a BWT transformed string “AAAAATTTTCCCGGGGG” can be represented as “5A4T3C5G”. This is just an example and there exist very highly efficient algorithms for this purpose.

After receiving the transformed string, we need to decode it to its original pattern text. The BWT string, accompanied with the original string position is used to decode the received transformed string into original text.

Let’s begin with the original matrix. We rotated the original text in the rightward direction while encoding the BWT. This time, we will simply rotate the pattern to the left. And we will continue appending the pattern to the matrix table and sorting the list.

Let’s take an example of transformed string ‘epa’ with index 0.

Step 1: Create matrix:

Step 2: Sort Matrix:

Step 3: Shift Matrix to Right.

Step 4: Append original Characters:

Step 5: Go through steps 1 to 4 upto the length of the string.

Finally, we obtain a matrix as follows:

From our original index, we can see, ape is the original String.

Code Implementation:

Here is the code implementation for both transforming and reverse transforming a string from BWT. The codes are self explanatory (Once you run them, they will produce useful outputs) and hosted on gist under GNU GPL.

BWT Transformation Code:

Reverse BWT Transformation Code:

The time complexities of the codes can be minimized a lot. I highly encourage the reader to seek them out and discuss them on the comments.

--

--