File compression in Rust: Huffman Coding

Bits Of Thought
7 min readSep 26, 2022

--

Compression is somewhat magical: we are all acquainted with popular formats such as zip, jpeg, mp4, and mp3 that compress their contents — sometimes lossless and sometimes lossy. While many popular compression algorithms are very complex, simple, lossless encodings can reduce many files to a fraction of their size with just a few lines of code!

In this multi-part series, we will implement different compression algorithms using Rust, working towards a full Burrows-Wheeler pipeline similar to bzip2. For this part, we set the groundwork by shrinking file sizes using a static algorithm: Huffman Coding. But before we begin, we need to establish some practical foundations. You can find complete sources on the GitHub Repository.

Static Encodings

In our case, every file is represented as a series of bytes, symbols ranging from 0 to 255. The idea behind static encoding is to replace each one of our input symbols, a byte, with output symbols, which will, on average, have a shorter length. The encoding algorithm is static because it does not consider where in the file a symbol appears: every byte is translated to the same code.

As an example, let’s consider the English string “Fearless concurrency”. If we save this text using UTF-8, it will be 20 bytes large, 8 bits per character. But do we really need to use whole 8 bits per character? For one, this text only contains 12 distinct characters — an easy optimization would be to represent every character with just 4 bits instead of 8. Furthermore, some symbols occur more often than others: we can lower the size of our text if we choose a shorter encoding for common characters than for uncommon or non-existent ones.

As you can see, we used a coding table to effectively compress our input “file”, which is now a lot smaller than 20 bytes. Of course, we want to later recreate the text “Fearless concurrency” from our output, meaning that decoding a compressed file must always result in our original input. We achieve this by choosing codes so that they are prefix-free and unique. If one code was the prefix of another, decoding a file would be ambiguous. If the codes fulfill our conditions, we can use the same table in the reverse direction, decompressing a file.

Huffman coding is simply a process to generate an optimal, prefix-free coding table. Optimal means that there is no other coding table that would result in a smaller file size than obtained when using Huffman codes!

Huffman Codes

Prefix-free codes can be visualized as leaves of a binary tree. The root marks a code’s beginning, and every edge along the path to a leaf adds a one or a zero to the resulting bit-code (depending on if the right or left edge was taken). The fact that codes can not be inner nodes implicitly makes the resulting codes prefix-free.

The Huffman-tree for the previous coding table

Notice that each node includes the frequency of all its subtree’s leaves, the root always having a frequency of 1.0 (20/20).

Huffman discovered that any two nodes with the lowest frequency are always at the bottom of an optimal tree (Else, you could improve a file’s compression by switching a higher, less frequent code with a lower, more frequent one). This, combined with the observation that every subtree is itself optimal, gives a way to construct an optimal coding table for any file.

The Huffman-Tree for a file is constructed as follows:

1. Create nodes for all 256 bytes including their frequency
2. Choose the lowest two parentless nodes (by frequency)
3. Create a new parent node containing both nodes and their
cumulative frequency
4. While there is more than 1 root left, repeat from 2.
Construction steps of our huffman-tree; Nodes in a step are linked to the resulting parent.

The above diagram illustrates how the Huffman-Tree of our example could have been constructed. Steps of the same color can be reordered, and children switched between them because the children’s frequencies are equal. As you notice, this means that there are multiple optimal Huffman-Trees for a file, each one using different codes.

Once we have the tree, it is straightforward to traverse it and create the coding table. Now let’s finally get to some coding!

Calculating Codes

Representing a Huffman-Tree in Rust

Rust’s borrow-checker makes working with trees a bit of a nuisance. A Node can’t contain a reference to its children, else their lifetime would have to be known when specifying the parent’s type. One fix is to have each parent own their children, making use of Box<T> . The most common representation is to wrap each child in Rc<RefCell<T>> — This way we can take references to children and modify them without having to move in and out of an Option<Box<T>> . We end up with the following definition:

Our fn create_tree() will receive the absolute counts of each byte in a file freqs and returns two results: A byte-indexed array of every leaf used for encoding and the root of our tree used for decoding.

Constructing the Tree

Before creating any parent nodes, we first need to construct every leaf:

Because our input symbols are simply bytes, we create 256 new nodes, fill in their fields, and insert them into a BinaryHeap . The Reverse wrapper reverses the nodes’ natural order, so that we can extract nodes with minimal frequency from the heap.

Recalling step two from the earlier pseudo-algorithm, we next need to continuously create new parents containing the two least-frequent roots until we only have one left. In Rust, we make simple use of while:

We’re almost done computing a Huffman-Tree! The last step is to traverse it and calculate the resulting codes, saving them in the mask field. We will use a depth-first search, keeping the current working set in a Vec<T> .

We start with the root and in each step, insert a 0 at the end of left.mask and a 1 at the end of right.mask . The queue will be empty once we visited every inner node, leaving us with the finished Huffman-Tree!

We are done writing fn create_tree() . Next, we have to implement en- and decoding.

Translating Codes

Once we begin translating codes, we need to find a way of sharing the Huffman-Tree between compression and decompression of a file. Recall that all the information we need to construct our tree is each byte’s frequency. There are multiple approaches to sharing the coding table:

  1. Simply include the table (or byte counts) in the compressed file
  2. Transform the coding table into canonical form and include it in the compressed file
  3. Start with a count of zero for each symbol, recalculating the Huffman-Tree after translating a single code (this will mostly be advantageous for small files)

We will be implementing method 1, prepending the count of each byte to a compressed file. While being easy to implement, simply including byte frequencies is not very space efficient (256 32-bit numbers are one KB in size).

Encoding

We calculate the frequency of each byte and use the information to call our create_tree() function. The tuple we return from create_tree() already contains an array representing the coding table. Because we use Rc<T> for referencing our nodes, the fact that we do not store root in a variable causes all inner nodes to be dropped! We iterate over the input file, looking up the corresponding Huffman code by indexing into the coding table. Since codes can be shorter or longer than 8 bits, we do some bit-shifting to chop our result into bytes.

Decoding

Decoding is somewhat more complicated: we read the input bit by bit and traverse the Huffman-Tree’s edges as we go along. But before we do that, we first read the byte counts that we prepended to the output file in encode .

On line 29 we store the root of our tree instead of its leaves since we will traverse it during decoding.

Now we can get to the actual translation:

And that’s it! We can now use this to encode any stream of bytes. Because we prepend 1 KB of metadata to each file, small files will vastly increase in size, but the overhead becomes less significant with larger file sizes.

Let’s try encoding Goethe’s Faust and see how much we can compress it by:

$ cargo run -- -i faust.in -o encoded
Finished dev [unoptimized + debuginfo] target(s) in 0.33s
Running `target/debug/compression -i faust.in -o encoded`
Size before: 206028 bytes.
Size after: 127510 bytes [61%].

In the next part, we will implement a borrows-wheeler transform, enriching our static Huffman Coding with dynamic information. Stay tuned!

--

--