C program for Text Compression using Huffman Coding

Adeesha Savinda de Silva
5 min readNov 15, 2020

--

Photo by Jeremy Kuehn on Unsplash

While David A. Huffman was a Ph.D. student at MIT, this method of coding was introduced to the world in 1951. He was given the task to find an efficient way of coding and came up with the idea of using a frequency-sorted binary tree and proved this method to be the most efficient. He published this in a paper in 1952 titled “A Method for the Construction of Minimum Redundancy Codes” and can be found in the link: http://compression.ru/download/articles/huff/huffman_1952_minimum-redundancy-codes.pdf.

You can find the code for the implementation of Huffman coding for text compression at my repository.
link: https://github.com/adeesha-savinda/huffman-encode-decode

Huffman coding is based on the idea that to represent the symbols with the most frequency of occurrence with the least number of bits and the symbols with the most number of occurrences be represented with the most number of bits. Before getting on to the action, a few of the core concepts are needed to be covered and are shown below. If you are okay with them, feel free to skip to the practical example section.

Data Compression

In communication; data compression is source coding where reduction of bits used is done. The overall aim is to use fewer bits to encode the data than the original number of bits. This coding is usually done at the source of the data before it is transmitted.
Consider five symbols A, B, C, D and E to be transmitted which have a probability of occurring 0.5, 0.2, 0.1, 0.1, and 0.1 respectively. Since there are 5 symbols or letters, we have to use at least three(3) bits to represent one symbol. But if we use the Huffman coding method this can be reduced as follows.

  • A — 000 can be represented as A — 0
  • B — 001 can be represented as B — 10
  • C — 010 can be represented as C — 010
  • D — 011 can be represented as D — 011
  • E — 100 can be represented as E — 100

Thus reducing the overall number of bits required to represent the text. For example, to send the message “ABCDEF” across will be reduced from

“000 0001 010 011 100” to “0 10 010 011 100” which is 3 bits lesser.

The idea is to replace the most frequently occurring symbols with a low number of bits and the least frequently occurring symbols with a higher number of bits. If you clearly examine, it can be found that the symbols generated using Huffman coding that are representing the above letters are never repeated.

Lossless vs Lossy Compression

The result of data compression can be either lossy or lossless. If the compression is lossy there will be a reduction in the bits where some of the unnecessary data will be removed and only an approximation of the original data will be recovered after decoding hence a loss of data occurs. Such as video and image compression algorithms.
On the other hand, no information will be lost compared to the original in the lossless data compression. The process of reducing the size of a data file is referred to as data compression. Lossless data compression is used in many applications. For example, it is used in the ZIP file.

Steps to build the Huffman tree

Huffman tree is a technique that is used to generate codes that are distinct from each other. Using this method, most occurring symbols will get the least number of bits and others accordingly.

  1. Sort all the different symbols and their particular frequency or probability.
  2. Take the two lowest probability symbols and create a new common node with the probability equal to the sum of the two probabilities. Always make sure to add the lowest summing up nodes.
  3. Add the new node to the table instead of the lowest two symbols or nodes.
  4. Repeat steps two and three until one symbol or node is remaining.
  5. Then allocate ‘0’ to all the right branches and ‘1’ to all right branches
  6. Read the bits from the top of the tree to the bottom of each symbol and record their particular bit pattern

A Practical Example

Enough of the theories. Let's get to work. Consider the following sentence.

“THIS IS COMMUNICATION ENGINEERING”

The following table represents the letters and their occurrences in the above sentence. You can use tally marks to count the no. of occurrences.

Figure 1 — Tabulated data of the number of occurrences

After creating the table of occurrences, the Huffman tree can be created as shown in the following diagram. Arrange all the letters in ascending order of their frequency of occurrence.

Note: Space is denoted as < > in the following diagram.

Figure 2 — Huffman Tree from the above table of occurrences

After generating the Huffman tree, the code for each symbol or letter is determined by following along the branches from top to bottom while concatenating 1(one) or 0(zero) into the code. This is shown in the following Figure 3 where it determines the code representing T.

Figure 3 — Determine the code for symbol T using the Huffman tree

In the original text “THIS IS COMMUNICATION ENGINEERING”, there are 33 characters. When ASCII is used to encode this text it will take a total of 264bits (i.e. 33 * 8bits) to encode this message. ASCII coding uses 8 bits (1 byte) to represent one letter. But using Huffman coding this is being reduced to 119bits. The compressed message is 45.08% of the original message. The representation of symbols is shown in Figure 4 below.

Figure 4 — ASCII vs Huffman Code comparison

The following Figure 5 shows how the message is encoded and the number of bits for each symbol.

Figure 5 — How the message is encoded

The C Program Implementation

This implementation of Text compression using Huffman coding consists of two programs. One is used to encode and the other is used to decode.

Full code: https://github.com/adeesha-savinda/huffman-encode-decode.git

The encode program accepts a text file as an argument. Then this file will be compressed using Huffman coding and will spit out two files that contain the frequency table and the compressed file. They have .table and .huffman as their file extensions respectively.

The decode program will accept the above two files. The file that contains the frequency table states the frequency of occurrence of a certain character in the original text file which is similar to the table shown in Figure 1. This file then can be used to regenerate the Huffman tree.

Note: This program only supports the ASCII character set

The Huffman encode program is shown below

The Huffman decode program is shown below

--

--

Adeesha Savinda de Silva

Associate Lead Solutions Engineer @WSO2 | IAM Expert | M.Eng. (Hons) Electronic Engineering