Decision Tree - For Beginners

Published in

Analytics Vidhya

4 min readJan 27, 2021

· A decision tree is a graphical representation of all possible solutions to a decision based on certain conditions.

· Decision Trees are versatile machine learning algorithms that can perform both classification and regression tasks, and even multioutput tasks. They most widely used to supervised learning.

· They are powerful algorithms, capable of fitting complex datasets.

· Decision Trees works on CART(Classification and Regression Trees) algorithm

Decision Tree Terminology

Before we get into how a decision tree works we need to understand some terminologies of a decision tree.

· Root node: This is the top most node from which a decision tree is constructed. It represents the entire population/sample and it further gets divided into two or more nodes (The purple root node in the figure above).

· Parent/Child node: Root node is the parent node and all other nodes branched from a parent/root node is called a child node.

· Branch/Sub Tree: Formed by splitting the tree/node.

· Splitting: Dividing the root/sub node into different parts on the basis of some condition

· Leaf node: Node that cannot be further divided/segregated e.g. Orange node(lower most) in the figure above.

· Pruning: Removing unwanted branches from a tree.

CART Algorithm

The CART (Classification and Regression Trees) algorithm is structured as a sequence of questions, the answers to which determine what the next question, if any should be. And the result to these questions is a tree like structure where the ends are leaf/terminal nodes where there are no more questions.

Construction of decision trees depend on the following.

· Entropy: This is the measure of randomness in the data.

· Formula: Entropy(s) = -P(yes)log2P(yes)-P(no)log2P(no)

S is the total sample space

P(yes) is probability of yes and P(no) is probability of no

If number of yes = number of no then i.e P(s) = 0.5 then Entropy(s) = 1

If data contains only yes or only no i.e P(s) = 1 or 0 then Entropy(s) = 0

· Information Gain: Information Gain measures the reduction in entropy. This decides which attribute should be selected for splitting at each step in building the decision tree right from the root node.

Formula: Information Gain = Entropy(s) — [(Weighted Avg)*Entropy(each feature)]

Where (Weighted Avg)*Entropy(each feature) is the Information of each feature.

The tree has to be simple to understand and to do so we need to choose a split that is pure. A commonly used Measure of purity is information. The information value tell us how much information a feature gives us. The split with the highest information gain is selected as the first split and the process continues.

· Gini Impurity: First lets understand what is pure and impure , pure means , that the data of selected sample belong to same class. And impure means ,that data contains mixture of different classes.

Gini Impurity is a measurement of the likelihood of an incorrect classification. If the dataset is pure then the likelihood of incorrect classification is 0 otherwise it will be high.

How to construct a decision tree

· Find out the Entropy of the whole dataset using the dependent variable.

· Find out the Information gain of each of the features(independent variables), and select the feature with the highest information gain as the root node.

· Form the tree and repeat the steps to obtain the whole decision tree.

Now you have built the tree , and it’s time to test it on the test data and you find out that the tree has actually overfitted the data, this is one of the major disadvantages of a decision tree. To avoid overfitting we remove the branches that make use of features of low importance. This method is called Pruning or post-pruning , this reduces the complexity of the tree and improves the testing accuracy. Another method is called early stopping or pre-pruning here we try to stop the tree-building process early, before it produces leaves. One problem with pre-pruning is that if we stop too early we may underfit the data.

Advantages of Decision Tree

· Easy to understand

· Can handle both categorical and numerical data

· Robust to outliers

Disadvantage of Decision Tree

· Prone to Overfitting

· Need to be careful with parameter tuning

· Can create biased learned trees if some classes dominate

Interview Questions on Decision Tree

· What is a Decision Tree and how does it work?

· What is the impact of outliers on a Decision Tree?

· How to avoid overfitting in decision tree?

· How does a decision tree work with categorical and numerical features in a single dataset?

· What are the libraries used for constructing a decision tree ?