Huffman Encoding — Compression basics in Python

Published in

Hashtag by IECSE

6 min readJul 26, 2019

Huffman compression is one of the fundamental lossless compression algorithms. ( Lossless algorithms are those which can compress and decompress data without any loss of data.)

It banks on the basic idea of representing the most common recurring subunit with the least number of bits. A lot of the time, this recurring subunit may be a character, like we are assuming in this article.

It’s a widely used technique, and some places where it’s used are JPEG images, MP3 files and the compression of DNA sequences.

The best way to learn about this is by making a compressor ourselves! We’ll be using python to create (not the most efficient, but working) model of a Huffman compressor.

Basics of Huffman Coding

This algorithm exploits the use of recurring characters to our advantage. We try to represent these recurring characters using fewer bits than they would normally take. For example, let’s assume we have a large text file containing 1000 occurrences of ‘A’, and only ~200 occurrences of the other alphabets.

Normally, each letter would take up 8 bits of data (1 Byte). This would mean the total size of data would be 8*number of characters. (in bits)

But if there were some way, in which we could try to represent ‘A’ in a way that took lesser than 8 bits, say 2 bits, we would be reducing our size (ideally) by 1000*(8–2) = 6000 bits.

For larger file sizes, this would be a significant change. One more thing we should keep in mind is that ‘A’ wouldn’t be the only letter that would be represented in lesser than 8 bits. For example, we may have a situation where with ‘A’, we may represent ‘C’ and ‘D’ with 3 bits each.

So how do we decide and figure out how to represent each letter? Well, thanks to David A. Huffman, we can find out the most efficient way to represent these characters in binary form using something called a Huffman tree. This tree gives us the most optimal representation of the binary code keeping in mind the frequency of each character.

For example, let’s say we want to compress a DNA sequence, we have a text file containing the string “AGGACTAAAAACG”

The distribution of each letter stands at -

‘A’ : 7
‘G’ : 3
‘C’ : 2
‘T’ : 1

Our Huffman tree would look something like this -

Don’t worry if you don’t know how this tree was made, we’ll come to that in a bit. But for now, let’s look at how much we can compress this string looking at the new representations.

Normally we would represent ‘A’ ‘G’, ‘T’ and ‘C’ with 8 bits each. But now the new representations are -

A: 0 (1 bit)
G: 10 (2 bits)
C: 110 (3 bits)
T : 111 (3 bits)

So something like ‘ACT’ would be represented as 0110111.

Some more examples of DNA sequences being converted to bits and their decimal representations

You can see here that we represented ‘ACT’ in only 7 bits instead of the usual 24 bits it would need. Similarly, our original string compressed would be -

Size = 7*(1 bit) + 3*(2 bit) + 2*(3 bit) + 1*(3 bit) = 22 bits!

That’s a good reduction from our usual (7+3+2+1)*8 = 104 bits.

I hope this gave you a quick insight into how Huffman compression works. Next, let’s see how we build a Huffman Tree.

Building Huffman Trees

The first and most fundamental step of building a Huffman tree is calculating the occurrences of each character. Since we’ll be using Python, a dictionary data structure would be the easiest way to do this.

Now suppose this is the current, character occurrence chart. What we want to do is start by grouping the two least occurring nodes (characters) and join them. This will be considered a new node with an occurrence value being the sum of the values of both the nodes, i.e. the total occurrences of both these characters.

Joining the 2 least occurring nodes, ‘C’ and ‘T’

Now we take the next two lowest value nodes and join them too. Once we join two nodes, we start considering them as one combined unit node.

We repeat this step until there is only one node in total. This node is our Huffman tree.

Now, to find the new representation of each character, we add a 0 for each left and 1 for each right edge. Here’s an image representation to help you understand what this means.

This time ‘round, I’m hoping you’ve understood how we got this tree.

Python Code

We’re going to be using a heap as the preferred data structure to form our Huffman tree. In python, ‘heapq’ is a library that lets us implement this easily.

Let’s start by making a function named encode, which accepts data in a string format.

import heapqdef encode(data):
    key = dict()
    encoded = str()
    corp = dict()

Next, let’s count the frequency of each character in the data and store it in our key.

# Count the frequency of characters in data
    for _ in data:
        corp[_]=corp.get(_,0)+1
        # corp.get returns a default value of 0 if key not found

We parse through the tree as previously discussed and at each point add ‘0’ to the nodes on the left and ‘1’ to the right.

while len(htree)>1:
        left_node = heapq.heappop(htree)
        right_node = heapq.heappop(htree)for _ in left[1]:
            # Add a 0 to the encoding of all nodes on the left
            key[_] = '0' + key.get(_,"")for _ in right[1]:
            # Add a 1 to the encoding of all nodes on the right
            key[_]= '1' + key.get(_,"")# Join both as one node and push to tree
        heapq.heappush(htree(left_node[0]+right_node[0],
                       left_node[1]+right_node[1]))

Let’s print out the encodings for the sake of understanding -

for x in key:
        print(x,key[x])

The rest of the code is pretty straightforward, where we fill the remaining bits with 0’s and store the encoded string.

''' 
Replace all our characters with their bit encodings and place     them in our string “encoding” 
'''for char in data:
    encoded+=key[char]
        
# Here we’re just filling the left over bits with 0's
rem = len(encoded)%8
key[‘rem’] = rem
encoded+=’0'*remreturn encoded

And that’s pretty much it. Simply pass a string into this function and it’ll return a compressed one as an integer value.

Remember, to decode compressed data, you’ll need the key too. I’ve provided the whole code below if you want to have a look at it and try it out for yourself too. The decoding part should make sense as its just the opposite of what we did above.

Conclusion

Now that you know how Huffman Compression works, let me warn you that this was just a very basic view of compression, and might not be the best way to implement it when you want to build a real-world compressor. Nevertheless, it’s a good starting point to understand how it all works.

Hope this article proved to be helpful. Any corrections or suggestions are welcome in the comments.

Huffman Encoding — Compression basics in Python

Basics of Huffman Coding

Building Huffman Trees

Python Code

Conclusion

Written by Ram Rathi