A Quick Start With Decision Tree
A Decision tree is the graphical illustration of all the various ways to make a decision based on specific conditions. These conditions are usually if-then statements or we can generally say it as yes or no. The more robust the tree, the more difficult conditions. Each condition will be a problem, and its solutions will help us quantify the problems.
Let’s illustrate this with a real-time example
- Let’s think we want to play badminton on a particular day say Saturday how will you determine whether to play or not.
- Let’s say you go out and check if it’s hot or cold, check the speed of the wind and humidity, how the weather is, i.e. is it sunny, cloudy, or rainy. You take all these factors into account to determine if you want to play or not.
A tree structure will Represent the following attributes:
- Decision node: It defines a test on a single attribute.
- Leaf node: It shows the value of the target attribute.
- Edge: It is a split of one attribute.
- Path: It is a disjunction of the test to make the final decision.
Node impurity is the correlation within a node. A node is impure if cases have more than one value for the response. A node is pure if all instances have the same value for the response or target variable or impurity = 0.
These are the two most popular methods for measuring node impurity:
- Entropy
- Gini
Entropy
In a decision tree, entropy is a kind of disorder or uncertainty. It’s the measure of impurity, disorder, or uncertainty in a bunch of data.
def get_entropy(data):
label_col = data[:, -1]
a, counts = np.unique(label_col, return_counts=True)
prob = counts / counts.sum()
entropy = sum(probabilities * -np.log2(probabilities))return entropy
A simple example of entropy:
Let’s say there is a bag that depicts two different scenarios:
- Bag A has 100 green balls. Peter wants to choose a green ball from this bag. Here, Bag A has 0 entropy because it implies 0 impurities or total purity.
- We replace 40 green balls in bag A with red balls, and similarly, we replace 10 green balls with black balls. Now, John wants to choose a green ball from this bag. In this case, the probability of drawing a green ball will drop down from 1.0 to 0.5 due to the increase in the bag’s impurity.
Information Gain
It is the primary key accepted by the Decision tree algorithm to build a Decision tree. The Decision Tree will improve to maximize information gain. The attribute which has the highest information gain will be tested or split first.
Gini index
Like entropy, the Gini index is also a type of criteria in decision trees that will calculate information gain. Information gain is used by the decision tree to split a node whereas Gini measures the impurity of a node. The range of Gini lies between 0 to 0.5. Gini impurity is better compared to entropy for selecting the best features.
Hope it helps you to understand the Basics of the Decision tree & few other approaches related to the Decision tree. Few More methods Like Entropy Calculations and few problems to overcome, such as overfitting, which I will explain in my upcoming Blog.
Happy Learning!
References:https://scikit-learn.org/stable/modules/tree.html
Originally published at https://www.numpyninja.com on June 7, 2021.