Algorithms In Context #3: Huffman Codes

Can Bayar
Analytics Vidhya
Published in
5 min readSep 13, 2020

--

Did you ever wonder how data compression programs like WinRAR or WinZip work? And did you ever ask yourself if we can compress the data, why not store and use it that way in the first place? If so, welcome to the third chapter of the series: Huffman Codes.

So, I just solved my 200th question in HackerRank after being off for nearly two years, which was incidentally about Huffman Decoding and found myself writing this post. Let me start by asking a question: How can you modify the data in your computer to take less space? By using fewer bits to represent the same data.

Well done Sherlock, we couldn’t have thought that! Don’t get so cocky though, because the most simple ideas are usually the most eloquent. Consider having a text file consisting of 10.000 characters. Assuming you are using 8-bits to store a single character, the text will take up 80.000 bits in your computer. But if we can find a way to use fewer bits to store a single character, we can lower the storage needs of the file.

Prefix Codes

Let’s simplify our constraints a little bit so we can make some calculations. We still have a 10.000 character long text but we have only 6 letters from a to f. We can represent 6 letters with 3-bits if we use fixed-length codes for each character and the entire text will take up 30.000 bits in the memory.

As you can guess, there are other ways to represent the characters. Instead of using 3-bits for each character, we can use variable-length codes, which can perform significantly better by the means of storage. In order to achieve this, we will follow two rules:

  • The characters with higher occurrence frequency in the text will be represented with a fewer number of bits
  • Codewords representing characters shall be unambiguously decodable

The example below is directly taken from MIT’s Introduction to Algorithms book, which is an amazing book and strongly advised for anyone who has any interest in algorithms. In the table, you can see the letter frequencies, fixed-length and variable-length representations of each letter for our text.

Using the variable-length codes, if you want to write the word “add” you need to write 0 111 111, which requires 2 bits less when compared to its fixed-length counterpart of 000 011 011. Let’s do the math, using the frequencies and variable-length codes for our 10.000 characters long text.

We reduced the space need of our text from 30.000 bits to 22.400 bits, a little bit more than %25 improvement. Much higher percentages are achievable but this is good enough for this example.

Constructing Huffman Codes

Now let’s talk about how did we derive the variable-length codes. It is very important to use unambiguously decodable codes because otherwise, you will not be able to know which character should your algorithm produce in the decompression stage.

A binary tree will help us to derive our variable-length codes. At each step, we take the two letters with the lowest frequency, merge them under a parent node and assign their total frequency to their parent (a priority queue may help us do that). We will continue doing this until all letters are merged under a single root. Here is the final tree constructed for our text file example (frequencies are given by percentage):

Once the tree is constructed, all you need to do is to follow the bits on the edges to acquire the corresponding variable-length code of a letter. This way, we acquire an unambiguous strategy to decode each letter because there is only a single path to each letter.

Here is one extra thing to notice: Since we use the lowest number of bits for the letter with the highest frequency, this is a greedy algorithm. You can read more about greedy algorithms in my previous post.

Now let’s write some code. Compressing the text shall be easy: All you need to do is to replace each character with the corresponding code. So, I will only share the decompression function, which I used to solve the HackerRank question I have mentioned at the start of the post.

void decodeHuffman(HuffmanTreeNode* root, string encoded) {    HuffmanTreeNode* runner = root;
for (char current : encoded) {
if (current = '0') {
runner = runner->left;
} else {
runner = runner->right;
}
if (runner->isLeaf()) {
cout << runner->data;
runner = root;
}
}
}

We iterate through each character (which can be either ‘0’ or ‘1’) in the encoded string and we take the corresponding route in the constructed tree. Whenever we encounter a leaf node, we know that we found a letter. Then we reset our pointer and do the same until the entire text is decoded.

Isn’t it amazing that something so simple is so effective? Huffman codes provide a very simple and fast lossless compression method, that inspired many more compression algorithms to come.

If we can do lossless compression on data, then why not always use the compressed data in our computers? Because the compressed data is not very easy to work with. In fact, we need to do extra operations all the time if we want to work with the compressed data, which wouldn’t be very efficient. For example, when we were using the fixed-length codes to store our text, we know that each letter is exactly 3 bits away from the previous and next letters. Can you say how many bits away the next letter is without decoding it first when you use the variable-length codes? Of course not. That is why we use fixed formats to store the data, even though they contain some redundancy.

That’s all for today, I’ll spend the rest of the evening by listening Iron Maiden. I hope this short post was enlightening and see you in the next section.

--

--

Can Bayar
Analytics Vidhya

Senior Software Engineer, Rock Vocalist & Guitarist