File Compression in Rust 2: Burrows-Wheeler Transform

Bits Of Thought
5 min readOct 24, 2022

--

In the previous part of this series, we learned about a popular static en- and decoding algorithm for compression — Huffman Coding. As we learned, Huffman codes are statically generated, meaning they do not consider the “environment” they occur in, instead relying solely on the statistical distribution of file contents. In this article, we learn a strategy to further decrease file size by accounting for patterns in the input, known as the Burrows-Wheeler transform, advancing toward a Rust encoding pipeline akin to bzip2’s implementation!

Dynamic Encodings

While Huffman codes are optimal in the sense that we do not have an alternative mapping of bytes to symbols that result in better compression, using them does not result in the smallest possible output size! The reason is additional information we can extract from already en- and decoded bytes, which get ignored in static algorithms. For example, if we generate Huffman codes for a file containing the repeated sequence 0x10,0x20,0x30,0x40 each of our four bytes would get encoded with two bits because they all have the same frequency. With the knowledge that 0x20 always follows 0x10 and so forth, we could encode our file as repetitions of already processed byte sequences and end up with a better compression ratio. Algorithms that receive feedback from already processed file contents are considered dynamic and are widely used in practice.

Burrows-Wheeler Transform

Enriching data with dynamic information

Instead of designing a new algorithm to perform dynamic file compression, the Burrows-Wheeler transform allows us to insert dynamic information into the static frequency of symbols, improving the efficiency of Huffman Coding! The basic idea is to “transform” our file so that a subsequent static encoding will be smaller. Of course, we require that the same transformation is reversible to allow for decompression.

Overview of compression pipeline

Rotating Blocks

Our transformation will operate on blocks of input of a given size. The initial step of our algorithm consists of producing every rotation of our input block and lexicographically sorting them. To rotate a block, we shift every symbol to the right as if we had a ring buffer, therefore shifting the last symbol in from the left.

Let’s go through a simple example for the block: “FERRIS”: First, we generate our 5 rotations and sort them. We can treat our rotations as rows and end up with the following “table”:

ERRISF # 5th rotation
FERRIS # 0th rotation
ISFERR # 2nd rotation
RISFER # 3rd rotation
RRISFE # 4th rotation
SFERRI # 1st rotation

Here comes the weird part — we take the last column of our resulting table and the row index of our original block and combine them to form our transformation output. At this point, one commonly wonders if it is even possible to reconstruct the initial input given nothing but the last column and a row index. It turns out it is, using the following approach:

1: We know that the first column contains the same symbols as the last but is lexicographically sorted, allowing us to reconstruct it.

2: Now we have a mapping between all symbols that are adjacent to each other in the original input! Recall that in each rotation, the last symbol is shifted around to the beginning.

3: To reconstruct our original input, we start at the row index and follow our “adjacency map” until we have completed the entire row.

We now know how to transform and reconstruct a block! Processing our input file is just the repetition of this procedure until all blocks have been transformed.

But why would this seemingly arbitrary transformation improve our huffman compression? The reason is that different strings in our input follow the same symbol, which are now grouped together in the last column! Take, for instance the english word “the”: any rotations starting with “he” are grouped together in subsequent rows, and hence the preceding “t” is clustered in our last column. The clustering of preceding characters is why the Burrows-Wheeler transform helps us encode dynamic information in static data.

Transferring to Rust

Let’s implement what we just learned in rust; we will use the same project and types from the previous part.

A lot of the above code is just boilerplate. Notably, our table represents all our rotations. Since duplicating the input slice for every row would be too memory intensive, we instead represent a rotation by the index of its first symbol.

We do all the heavy lifting in this code: We sort our rotations and map them to the output column. On line 19, we figure out the last column’s content by looking at the preceding symbol in our slice.
And that is all there is to encoding! But without a function for inversing the transformation and reconstructing our input, we can’t do anything yet. Let’s implement decoding next; we can make decoding more efficient by not explicitly generating a map between first and last symbols of each rotation. Instead, we enumerate each row of our last column before sorting it. When accessing row i in the first column, we will implicitly know the row of the next symbol!

As you can see, reconstruction, despite sounding more daunting, is not more complicated than the initial transformation.

And that concludes part two of our file compression series. When you try to connect our new Borrows-Wheeler transform to our Huffman encoder, you will notice that it doesn’t decrease file size yet. That is because, although the BWT clusters similar characters in its output, we did not yet make any modification the their static distribution! With Huffman Coding being a static procedure, rearranging symbols does not change anything about the compression ratio.

In the next part we will implement a simple transformation to go between our Burrows-Wheeler transform and our Huffman backend to actually remap our now clustered symbols, and thus reduce our output file size. Stay tuned!

--

--