Greedy Huffman in Ruby
As we all know Huffman Coding is a coding technique used to do lossless(which can be recovered fully unlike lossy compression which in contrast recovers data approximately)compression of the data.
In this we are generally given a set of n characters(c) from alphabets A, where c belongs to A, their frequency is also given f(c), and we have to find a binary code for each character such that summation of frequency with character bits is minimum.
For example if we are given a set of characters, a(10), c(13) , e(15), y(3), q(2), h(8), k(9), with their frequency in brackets.
If we are not using any sort of compression, then these 7 characters can be represented by 3 bit codes, so that the total number of bits taken would be equal to (10+13+15+3+2+8+9) * 4= 240 characters.
But if we are to use Huffman encoding, which is based on greedy methodology, meaning that we can get a locally optimal solution and the locally optimally solution of each subproblem step by step gives us the complete and correct solution. Since we are compressing the data, we also have to make sure now that we maintain a proper prefix code system(meaning there is no whole code word in system which works as prefix of another word),
For e.g say ‘01’ represents a, ‘010’ represents b, then it violates the prefix code, as we won’t be able to tell while decompressing the file, what to decompress say ‘01010’ too. So we have to ensure that we have a prefix code applied.
We can implement this technique with the help of two data structures
- Priority Queues
- Binary Trees
For Priority Queue implementation, we are using min-heap, you can check this:
But the reason we are using min-heap, is because in this greedy approach we are taking the two smallest elements from our priority queue and we will merge them to form a node of a tree. Now while traversing that tree, we will assign 0 to the node when we move left and 1 to the node when we move right.
The path from root to the leaf , will then determine the code we need to encode the data.
This will print this:
c =>[0, 0], q =>[0, 1, 0, 0], y =>[0, 1, 0, 1], h => [0, 1, 1, 1], e =>[1, 0, 1, 1], k => [1, 1, 0, 1], a => [1, 1, 1, 1],
which if we calculate according to the number of bits would be equal to 214 bits. (4*47) + (2*13) bits.
