# What is Entropy and why Information gain matter in Decision Trees?

According to Wikipedia,

Entropyrefers to disorder or uncertainty.

Definition:Entropyis the measures ofimpurity,disorderoruncertaintyin a bunch of examples.

## What an Entropy basically does?

*Entropy controls how a Decision Tree decides to **split **the data. It actually effects how a **Decision Tree** draws its boundaries.*

*The Equation of Entropy:*

# What is Information gain and why it is matter in Decision Tree?

Definition:Information gain (IG)measures how much “information” a feature gives us about the class.

*Why it matter ?*

*Why it matter ?*

**Information gain**is the main key that is used by**Decision Tree Algorithms**to construct a Decision Tree.**Decision Trees**algorithm will always tries to maximize**Information gain**.- An
**attribute**with highest**Information gain**will tested/split first.

*The Equation of Information gain:*

**To understand Entropy and Information gain, lets draw a simple table with some features and labels.**

Here in this `table`

,

`Grade`

,`Bumpiness`

and`Speed Limit`

are the features and

is label.**Speed**- Total four observation.

*First, lets work with ***Grade ***feature*

In the `Grade`

column there are four values and correspond that values there are four labels.

Lets consider all the labels as a parent `node`

.

`SSFF => parent node`

So, what is the entropy of this parent node ?

Lets find out,

firstly we need to find out the *fraction of examples* that are present in the parent node. There are 2 types*(slow and fast)* of example present in the parent node, and parent node contains total 4 examples.

`1. P(slow) => fraction of slow examples in parent node`

2. P(fast) => fraction of fast examples in parent node

lets find out `P(slow)`

,

p(slow) = no. of slow examples in parent node / total number of examples

Similarly the fraction of fast examples `P(fast)`

will be,

So, the **entropy** of parent node:

`Entropy(parent) = - {0.5 log2(0.5) + 0.5 log2(0.5)}`

= - {-0.5 + (-0.5)}

= 1

So the *entropy* of parent node is `1`

.

Now, lets explore how a

Decision Tree Algorithmconstruct aDecision Treebased onInformation gain

First lets check whether the parent node split by `Grade`

or not.

If the ** Information gain** from

`Grade`

feature is greater than all other features then the parent node can be split by `Grade`

.To find out *Information gain of *`Grade`

feature, we need to virtually split the parent node by `Grade`

feature.

Now, we need to find out the entropy both of this child nodes.

** Entropy** of the right side child node

`(F)`

is `0`

, *because all of the examples in this node belongs to the same class*.

Lets find out ** Entropy **of the left side node

`SSF`

:In this node `SSF`

there are two type of examples present, so we need to find out the ** fraction of slow and fast example **separately for this node.

`P(slow) = 2/3 = 0.667`

P(fast) = 1/3 = 0.334

So,

`Entropy(SSF)`** = **- {0.667 log2(0.667) + 0.334 log2(0.334)}

= - {-0.38 + (-0.52)}

= 0.9

we can also find out the *Entropy *by using `scipy `

library.

Now, we need to find out `Entropy(children)`

with weighted average.

`Total number of examples in parent node: 4`

" " " " " left child node: 3

" " " " " right child node: 1

*Formula of Entropy(children) with weighted avg. :*

**[Weighted avg]Entropy(children)** =

(no. of examples in left child node) / (total no. of examples in parent node) * (entropy of left node)

+

(no. of examples in right child node)/ (total no. of examples in parent node) * (entropy of right node)

Entropy(children) with weighted avg. is = **0.675**

So,

`Information gain(Grade) = 1 - 0.675`

= 0.325

** Information gain** from

`Grade`

feature is `0.325`

.**Decision Tree Algorithm** choose the highest Information gain to *split/construct* a** Decision Tree.** So we need to check all the feature in order to split the Tree.

Information gain from

`Bumpiness`

The ** entropy **of left and right child nodes are same because they contains same classes.

** entropy(bumpy)** and

**both equals to**

*entropy(smooth)*`1`

.So, ** entropy (children) **with weighted avg. for

`Bumpiness`

:`[weighted avg.]entropy(children) = 2/4 * 1 + 2/4 * 1`

= 1

Hence,

`Information gain(Bumpiness) = 1 - 1`

= 0

Till now we have to ** Information gain**:

`IG(Grade) => 0.325`

IG(Bumpiness) => 0

Information gain from

SpeedLimit

The ** entropy **of left side child node will be

`0`

, *because all of the examples in this node belongs to the same class.*

Similarly, ** entropy **of right side node is

`0`

.Hence, ** Entropy(children)** with weighted avg. for

`SpeedLimit`

:`[weighted avg.] entropy(children) = 2/4 *0 + 2/4 *0`

= 0

So, ** Information gain **from

`SpeedLimit`

:`Information gain(SpeedLimit) = 1 - 0`

= 1

## Final Information gain from all the features:

`IG(Grade) => 0.325`

IG(Bumpiness) => 0

IG(SpeedLimit) => 1

As we know that,

Decision Tree AlgorithmconstructDecision Treebased on features that have highestInformation gain

So, here we can see that `SpeedLimit`

has highest *Information gain**. *So the final **Decision Tree** for this datasets will be look like this:

## Also, Read

- Best Crypto APIs for Developers
- The Best Crypto Trading Bots
- The Best Bitcoin Hardware wallet
- The Best Crypto Tax Software
- Best Crypto Trading Platforms
- Best Wallet for Uniswap
- Best Crypto Lending Platforms
- Top DeFi Projects
- Bitsgap review — A Crypto Trading Bot That Makes Easy Money
- Quadency Review- A Crypto Trading Bot Made For Professionals
- 3commas Review | An Excellent Crypto Trading Bot
- 3Commas vs Cryptohopper
- The Idiots Guide to Margin Trading on Bitmex
- The Definitive Guide to Crypto Swing Trading
- Bitmex Advanced Margin Trading Guide
- Crypto arbitrage guide: How to make money as a beginner
- Top Bitcoin Node Providers
- Best Crypto Charting Tool