Decision Tree - For Beginners
· A decision tree is a graphical representation of all possible solutions to a decision based on certain conditions.
· Decision Trees are versatile machine learning algorithms that can perform both classification and regression tasks, and even multioutput tasks. They most widely used to supervised learning.
· They are powerful algorithms, capable of fitting complex datasets.
· Decision Trees works on CART(Classification and Regression Trees) algorithm
Decision Tree Terminology
Before we get into how a decision tree works we need to understand some terminologies of a decision tree.
· Root node: This is the top most node from which a decision tree is constructed. It represents the entire population/sample and it further gets divided into two or more nodes (The purple root node in the figure above).
· Parent/Child node: Root node is the parent node and all other nodes branched from a parent/root node is called a child node.
· Branch/Sub Tree: Formed by splitting the tree/node.
· Splitting: Dividing the root/sub node into different parts on the basis of some condition
· Leaf node: Node that cannot be further divided/segregated e.g. Orange node(lower most) in the figure above.
· Pruning: Removing unwanted branches from a tree.
CART Algorithm
The CART (Classification and Regression Trees) algorithm is structured as a sequence of questions, the answers to which determine what the next question, if any should be. And the result to these questions is a tree like structure where the ends are leaf/terminal nodes where there are no more questions.
Construction of decision trees depend on the following.
· Entropy: This is the measure of randomness in the data.
· Formula: Entropy(s) = -P(yes)log2P(yes)-P(no)log2P(no)
S is the total sample space
P(yes) is probability of yes and P(no) is probability of no
If number of yes = number of no then i.e P(s) = 0.5 then Entropy(s) = 1
If data contains only yes or only no i.e P(s) = 1 or 0 then Entropy(s) = 0
· Information Gain: Information Gain measures the reduction in entropy. This decides which attribute should be selected for splitting at each step in building the decision tree right from the root node.
Formula: Information Gain = Entropy(s) — [(Weighted Avg)*Entropy(each feature)]
Where (Weighted Avg)*Entropy(each feature) is the Information of each feature.
The tree has to be simple to understand and to do so we need to choose a split that is pure. A commonly used Measure of purity is information. The information value tell us how much information a feature gives us. The split with the highest information gain is selected as the first split and the process continues.
· Gini Impurity: First lets understand what is pure and impure , pure means , that the data of selected sample belong to same class. And impure means ,that data contains mixture of different classes.
Gini Impurity is a measurement of the likelihood of an incorrect classification. If the dataset is pure then the likelihood of incorrect classification is 0 otherwise it will be high.
How to construct a decision tree
· Find out the Entropy of the whole dataset using the dependent variable.
· Find out the Information gain of each of the features(independent variables), and select the feature with the highest information gain as the root node.
· Form the tree and repeat the steps to obtain the whole decision tree.
Now you have built the tree , and it’s time to test it on the test data and you find out that the tree has actually overfitted the data, this is one of the major disadvantages of a decision tree. To avoid overfitting we remove the branches that make use of features of low importance. This method is called Pruning or post-pruning , this reduces the complexity of the tree and improves the testing accuracy. Another method is called early stopping or pre-pruning here we try to stop the tree-building process early, before it produces leaves. One problem with pre-pruning is that if we stop too early we may underfit the data.
Advantages of Decision Tree
· Easy to understand
· Can handle both categorical and numerical data
· Robust to outliers
Disadvantage of Decision Tree
· Prone to Overfitting
· Need to be careful with parameter tuning
· Can create biased learned trees if some classes dominate
Interview Questions on Decision Tree
· What is a Decision Tree and how does it work?
· What is the impact of outliers on a Decision Tree?
· How to avoid overfitting in decision tree?
· How does a decision tree work with categorical and numerical features in a single dataset?
· What are the libraries used for constructing a decision tree ?