Information Theory

10 min readSep 18, 2019

For most of us, communication is something that comes naturally and binds living creatures together so we can connect, work and share ideas with countless people of the world. Due to the convenience and huge accessibility of the modern communication systems, we often take it for granted. The ability to select a message at one point and reproduce the exact replica of it at another location does not intrigue us. However, this was the fundamental problem and the area demanding intense research in 20th-century ¹. To perceive how admirable the evolution of communication and information theory is, let’s go back in time and consider telecommunications in the 1940s.

Back then, the telephone system was growing rapidly; however, the major problem was the difference between the received and the transmitted signal. The received signal was not exactly equal to the transmitted signal but transmitted signal and a bit of noise (Figure: 1). When the signal was amplified, the noise got amplified too resulting in a weakened signal at the receiver’s end⁵ (Figure: 2).

Figure 1: The receiver received both the source signal and the noise.

Figure 2: The amplifier amplified both signal and noise.

“Before 1948, there was only the fuzziest idea of what a message was. There
was some rudimentary understanding of how to transmit a waveform and
process a received waveform, but there was essentially no understanding of how to turn a message into a transmitted waveform.”
[Gallager, Claude Shannon: A Retrospective, 2001 pg. 2683]²

In 1948, Shannon published a paper “A Mathematical Theory of Communication” in the Bell Systems Technical Journal and showed how information can be quantified with absolute precision. He believed that all information media: telephone signals, text, radio waves, images and essentially every mode of communication can be encoded in bits¹. Now, instead of simply amplifying the message, one can read the digitized message of the transmitted signal (sequel of 0s and 1s) and repeat it exactly at the receiver’s side⁵.

In his paper, Shannon introduced four major concepts²:

Channel Capacity & The Noisy Channel Coding Theorem
Digital Representation
Source Coding
Entropy & Information Content

Channel Capacity & The Noisy Channel Coding Theorem

To carry any message to the receiver, a medium/channel is required. For instance, students (receiver) are able to listen to the teacher (source) in a classroom through the air (channel). The electromagnetic field can be used as a channel for longer communication. The amount of information that can be carried by any channel is it’s capacity which is measured in bits per second, though we rather use units like megabits per second (Mbit/s) or megabytes per second (MB/s) nowadays.

According to Shannon, every communication channel has a speed limit (also known as Shannon Limit) below which we can transmit any information (even noisy and faint signals) with zero error. However, it is impossible to get error-free communication above the limit as the channel cannot go faster than the limit without losing some information; no matter how sophisticated an error correction scheme is used or how much the data is compressed.

In particular, suppose we want to download an image from Google. There is a certain speed limit of the internet (channel) below which we can download the image without any loss of quality. The average rate of transfer can be deduced from the average size of encoded images and the channel’s capacity⁵:

Rate of transfer = Capacity/(Average size) 
                 = (10 Mbits/s)/(1 Mbits/File) 
                 = 10 Files/s

We can download the images much faster by improving the encoding of images. With the most optimal encoding, an optimal rate of image transfer can be determined as below⁵:

Maximum rate of transfer = Capacity/(Shannon’s Entropy)

Shannon’s noisy channel coding theorem states that we can reconstruct the information perfectly (with a probability close to 1) even in a noisy signal by adding redundancy with enough entropy. However, there is a limit to the minimum quantity of redundancy required to retrieve the original message as it was. Example: If we take a CD and scratch it, it will playback perfectly. This is similar to the redundancy in English words: Whenever we hear she l*v*es dogs, we can easily fill in the blank.

Digital Representation

Shannon realized that any information sources (text, sound, image, video) can be represented as 0’s and 1’s before passing to the channel. Thus, the content of the message is irrelevant to its transmission.

Source coding

Basically, source coding removes redundancy from the information and make it optimal. The most optimal source code today — Huffman coding — own an interesting story of invention². Shannon introduced ‘the Shannon-Fano code’ in 1948 which was not always optimal. So, as an alternative to the final examination, Fano gave his class an option to write a term paper on finding the most efficient coding method. Just when Huffman, a student of Fano, was deciding to give up, he realized that the optimal code can be obtained by representing the most frequent symbols with the shortest codes.

Suppose a person X loves “pizza”, mostly orders it in a restaurant and never orders other foods except “mo:mo”, “spaghetti” and “sandwich”. Traditionally, the codewords to each food were assigned as:

All the codewords are 2 bits long regardless of how common they are. The area is the average length of a codeword we send — in this case, 2 bits.

Let’s use Huffman code now — we will assign “pizza” the shortest code and give longest code to the uncommon foods.

Now, the average length of the codeword is 1.75 bits and we have reached the optimal length.

Entropy and Information Content

The amount of information that can be sent down a noisy channel can be defined in terms of transmit power and bandwidth¹. We can either send information using high power and low bandwidth or high bandwidth and low power. Traditionally, the narrow-band radios were used, which focussed all their power into a small range of frequencies and were highly susceptible to interference: extreme power was confined to small portions of the spectrum². Shannon offered a solution to this by quantifying the amount of information in a signal, stating that information is the amount of unexpected data contained in a message¹. He termed the information content of a message ‘entropy’. Thus, the more a transmission resembles random noise (which is unexpected bits), the more information it holds².

Identifying the outcome of a fair coin flip (with two equally likely outcomes) provides less information (lower entropy) than specifying the outcome of a roll of dice (with six equally likely outcomes). The entropy of a uniform deck of cards increases as we shuffle them randomly more and more.

Some history:

“Coding is dead”

The researchers in the 20th century believed that coding had a limited role as its applications were confined only to deep space communication and military.² In 1971, there was a coding workshop entitled “Future Directions” held in St. Petersburg, Florida². After spending a day or two discussing future possibilities of coding, the researchers concluded that everything that was of interest to their community was finished and that there was no future direction, except out². In the conference, Robert McEliece gave an infamous talk entitled “Coding is Dead”²:

The thesis of his talk was that he and other coding theorists formed a small inbred group that had been isolated from reality too long. He illustrated this talk with a single slide showing a pen of rats that psychologists had penned in a confined space for an extensive period of time. I cannot tell you here what those rats were doing, but suffice it to say that the slide has since been borrowed many times to depict the depths of depravity into which a disconnected group can fall. The rats made Lord of the Flies look like a school picnic. All this depraved behavior had its parallel in the activities of coding theorists.
Too many equations had been generated with too few consequences… Coding theorist professors had begotten more coding theory PhDs in their own image…no one else cared; it was time to see this perversion for what it was. Give up this fantasy and take up a useful occupation… Coding is dead.
[Lucky, R. Lucky Strikes Again ’93, Compilation of articles from IEEE Spectrum]

“Coding theory is not dead”, Irwan Jacob said at the conference as he pulled out a 4-bit integrated shift register from his pocket², “and this is why.”

The availability of hardware technology gave way to its revival.

Now, let’s come back to the present :D

Calculating entropy

We know that a highly improbable event has more information than some very likely event. Therefore, our measure of information content will depend upon probability distribution p(x). As h(x) is a monotonic function of p(x), we can express the information content in terms of p(x)⁶. If two events x and y are unrelated, the total information gain from observing them is the sum of the information gained from each of them so that⁶:

h(x,y) = h(x) + h(y)
p(x,y) = p(x).p(y)
h(x) = -logp(x)

The negative sign ensures that the information is positive or zero⁶. Shannon had assumed the logarithm to be binary. But it can have other bases too.

Earlier, we discussed that there is a fundamental limit to shorten the average message to communicate events. This limit, the average message length using the best possible code, called the entropy of p, H(p) can be calculated by taking the expected value of h(x) using distribution p(x, y)⁶.

Entropy in thermodynamics:

Ludwig Boltzmann shook the world of physics in 1877 by forging a link between a property of the microscopic world (multiplicity) and the macroscopic world (energies, entropy and equilibrium constants), which greatly confirmed the atomic theory.

Consider a set of N identical objects that are equally divided amongst a set of ith bins. There are N ways to choose the first object, n1; (N − 1) ways to choose the second object, n2; and so on, leading to a total of N! ways to allocate all N objects to the i bins, where N! denotes the product N ×(N −1)×· · ·×2×1. The rearrangements of objects within each bin are indistinguishable so that in the ith bin there is ni! ways of reordering the objects. The total number of ways of allocating the N objects to the bins is given by⁶:

which is called the multiplicity. The entropy is now defined as⁶:

If we consider the limit N → ∞, held ni/N fixed, and apply Stirling’s approximation⁶, we get:

where the sum of all the items in the bin is replaced by N.

Entropy in Big bang theory (Arrow of time):

Entropy is the only quantity in the physical sciences that moves in a particular direction in time, sometimes called an arrow of time. As one goes “forward” in time, the second law of thermodynamics says, the entropy of an isolated system can increase, but not decrease. Just as a recently purchased deck of cards is uniformly ordered, our universe started with a low entropy state, must necessarily increase over time resulting in the final heat-death of our universe.

Figure 3: Universe expanding to the high entropy state. Image source: http://www.science4all.org/article/entropy/

Cross entropy

Now, let’s get back to the example where person X loves pizza. Let’s suppose, he has a friend Y whose favorite food is mo:mo.

The two of them eat similar foods, just at different frequencies. Initially, Y ordered food using X’s code. However, the length of the codeword was longer: X’s code is optimized only to his probability distribution⁴. Y has a different probability distribution for which his code is suboptimal. The average length of a codeword when X uses his own code was1.75 bits, while when Y uses his code it’s 2.25. It could have been even worse if the two weren’t so similar!⁵

This length — the average length of communicating an event from one distribution with the optimal code for another distribution — is called cross-entropy⁵.

Thus, Y should be assigned a different code for making his order length optimal.

So, now we have four possibilities:

When X uses his own code (H(p)=1.75 bits)
When Y uses X’s code (Hp(q)=2.25 bits)
When X uses his own code (H(q)=1.75 bits)
When Y uses X’s code (Hq(p)=2.375 bits)

Therefore, cross-entropy is not symmetric: Hp(q)≠Hq(p)⁵.

Let’s conclude:

We have learned about information theory, entropy, some interesting concepts from Shannon’s paper and how the principle of entropy was applicable to other fields back from the days of big bang theory. There are more interesting concepts in entropy like mutual information, KL Divergence, Jensen-Shannon Distance which is explained wonderfully here.

References:

Fano, Shannon: A Mathematical Theory of Communication.
Aftab et. al.: Information Theory.
DeDeo Simon: Information theory for intelligent people.
Colah, Christopher: Visual Information Theory.
Science4all: Shannon’s Information Theory.
Bishop, Christopher: Pattern Recognition and Machine Learning

Information Theory

Written by Digya Acharya