**Part 2: Information Theory | Statistics for Deep Learning**

Information Theory is a branch of Applied Mathematics and treated to be one of the dry topics that marginally touches Machine learning (ML). This was coined by Claude Shannon to explore fundamental limits on signal processing back in 1940's.

I was almost on the verge of bypassing this topic because of complexity behind it and lack of simpler explanations. This mysterious topic only started making more sense after going through a bunch of weblinks and textbooks, & motivated me to come up with this post in simple terms. I only intend to throw my two cents on the intuition of **Information Theory** and its significance in the field of ML in layman terms. So, I dedicate this post to a novice in statistics who has stumbled upon information theory and struggling to understand it.

Please note that this post assumes the reader to be equipped with prior knowledge on random variables, probability distribution (discrete and continuous) and expected values.

**What is Information theory?**

Information is an ordered symbol of sequences to interpret its meaning. Basic idea is quantification and communication of information in the form of an **Entropy** (discussed below).

**Self-information: **It is defined as a measure of information associated to a single value of discrete random variable. It is also called as the reciprocal of occurrence of an event. Let’s better understand it with an example:

Message 1 — “*Sun rises in the east*.” This is a highly certain event and has highest occurrence. So, the amount of information obtained in this case is zero or less.

Message 2 — “*It is going to rain in the desert tomorrow*”. This is highly uncertain event and occurrence is low compared to the first message. So, this gives us more information compared to the previous one.

Self-information **I** is given by :

I (X(i)) = log(1/ p(Xi))

where,** X(i)** has **x1**, **x2**, ….. as a set of random events, **P(X(i))** is Probability of occurrence of an event, & **I (X(i))** is the amount of information obtained from the occurrences of above events.

From above formula, we can conclude that the information content only depends on probability **p(Xi) **and not on actual value. Self-information is expressed as a unit of information in bits/shannons, nats or hartleys (depending on base of logarithm). This only deals with single outcome.

**Entropy:**

Entropy (Shannon Entropy by default) describes the degree of uncertainty/ mess of a random variable which can take multiple states. It is the Expected value of self-information from values of a discrete random variable and is also considered as the weighted average of self-information for various outputs.

This can also be written in terms of expected values as:

It is directly related to the minimum expected number of binary questions needed to identify the random variable. Larger the entropy, more questions we’ll have to ask, more uncertainty around identity of a random variable. It gives the lower bound on number of bits (*units can vary*) required for encoding symbols drawn from any distribution **P**. Lower or higher values of Entropy depend on the distribution (indicating likelihood of outcome being certain or not).

**Differential Entropy** is an entropy for continuous distribution of a random variable **x** unlike the case above.

*Usage: *In Decision tree, Information Theory is used to decide the best split at each level to minimize entropy.

**Kullback- Leibler (KL) Divergence:**

It measures distance between two probability distributions **P** and **Q** over the same random variable **x**.

The formula for *discrete distribution* (Binomial, Poisson, Geometric, Bernoulli, etc.) is given by:

And, for *continuous distribution** (uniform, normal, chi squared, etc.) as*:

There are three basic differences between a continuous and a discrete probability distribution: At first, the probability that a continuous variable will take a specific value is equal to zero. Secondly because of this, we can never express continuous probability distribution in a tabular form. Lastly we require an equation or a formula to describe such kind of distribution. Such equation is termed as probability density function.

The significance of this distribution is as an extra amount of information needed to send a message containing symbols drawn from **P** distribution, when we use a dedicated code to minimize length of messages drawn from **Q** distribution. One of the reasons it is called divergence is its asymmetry and disregard to *triangle inequality *property.

It is generally non-negative and is **0** only when both the distributions are equal (**P** = **Q**) for *discrete variables* or throughout equal for *continuous variables*. KL divergence distribution can be used in supervised learning when modeling data to a particular distribution.

**Cross Entropy**, which is derived from KL Divergence is used to define the *loss function* and for *optimization* in Machine learning.

**

This is my limited understanding of the topic so far and I wish to expand my horizon. Once done, I would certainly add more content to this article. Thank you for your time and in case any incorrect information meets your eye OR if you would like to send across your feedback, then please do use the comments sections below.**