Greedy Huffman in Ruby

Ruby Devs
Ruby Devs
Sep 3, 2018 · 2 min read

As we all know Huffman Coding is a coding technique used to do lossless(which can be recovered fully unlike lossy compression which in contrast recovers data approximately)compression of the data.

In this we are generally given a set of n characters(c) from alphabets A, where c belongs to A, their frequency is also given f(c), and we have to find a binary code for each character such that summation of frequency with character bits is minimum.

For example if we are given a set of characters, a(10), c(13) , e(15), y(3), q(2), h(8), k(9), with their frequency in brackets.

If we are not using any sort of compression, then these 7 characters can be represented by 3 bit codes, so that the total number of bits taken would be equal to (10+13+15+3+2+8+9) * 4= 240 characters.

But if we are to use Huffman encoding, which is based on greedy methodology, meaning that we can get a locally optimal solution and the locally optimally solution of each subproblem step by step gives us the complete and correct solution. Since we are compressing the data, we also have to make sure now that we maintain a proper prefix code system(meaning there is no whole code word in system which works as prefix of another word),

For e.g say ‘01’ represents a, ‘010’ represents b, then it violates the prefix code, as we won’t be able to tell while decompressing the file, what to decompress say ‘01010’ too. So we have to ensure that we have a prefix code applied.

We can implement this technique with the help of two data structures

  1. Priority Queues
  2. Binary Trees

For Priority Queue implementation, we are using min-heap, you can check this:

But the reason we are using min-heap, is because in this greedy approach we are taking the two smallest elements from our priority queue and we will merge them to form a node of a tree. Now while traversing that tree, we will assign 0 to the node when we move left and 1 to the node when we move right.

The path from root to the leaf , will then determine the code we need to encode the data.

This will print this:

c =>[0, 0], q =>[0, 1, 0, 0], y =>[0, 1, 0, 1], h => [0, 1, 1, 1], e =>[1, 0, 1, 1], k => [1, 1, 0, 1], a => [1, 1, 1, 1],

which if we calculate according to the number of bits would be equal to 214 bits. (4*47) + (2*13) bits.

Ruby Devs

Written by

Ruby Devs

The things we encounter daily as rubyists and sometimes while switching jobs :p

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade