Understanding Decision Trees
You can relate our decision making process with functionality of decision tree. When an important event is going to happen, we prepare ourselves for n-different scenarios in which event can take place, i.e we branch out potential solution(s) or response for each scenario. If scenario A happens then I will jump , else I will sit. Recursively branching this event with your potential response until we reach the potential outcome, i.e we are making decisions one step at a time until we reach the final outcome. When you visualize this process ,the structure it forms is of a tree.
Suppose we want to predict the outcome of a cricket match, we would typically check out the star players; the other players; the stadium; pitch, crowd; season etc.. in an appropriate order. At each step, we consider if that factor is pro-victory or pro-defeat. Such a binary walk takes us to a reasonable prediction of the outcome.
This is the decision tree algorithm — we build a tree of decisions that leads us to the final decision.
With respect to machine learning , decision tree is a supervised machine learning algorithm, which you can use for both — classification and regression.
This is different from the process of logistic regression — where we would try to come up to a probability based on the different parameters, and then validate if the probability is more or less than a threshold. Our mind does not always follow that process.
Decision tree splits your data set by applying some functions on it, in such a way that data set finally ends up in different homogeneous buckets, i.e a bucket belongs to only one single class. When we create decision tree, we need to regularize it. Regularization in terms of decision tree means to control the growth of the tree. When decision tree becomes too large they tend to over-fit. To avoid over-fitting, we regularize the tree. By doing so, the tree does not grow to its full potential, i.e, it gets restricted. Hence the end bucket sometimes are not homogeneous. When you see buckets at the last nodes which consist of the target, you will notice that some of records belonging to positive class and some belonging to negative class. Hence we find the probability of belonging to positive class and negative class. These probabilities are called posterior probability.
The question arises that How does tree decide or select the column to create branches and leaf node?
Decision tree uses learning mechanism called loss function or optimization function.
In decision tree loss function is a way of reducing impurity in target column(i.e the last bucket at leaf), which means in decision tree making target column at leaf node more and more homogeneous.
To calculate impurity in the bucket or node we use Entropy .
Entropy
It measures the purity of the splits.Let’s suppose that we have some attributes or features. Now between these features, you have to decide which features you should use as the main node that is a parent node to start splitting your data. So for deciding which features you should use to split your tree we use the concept called entropy.
How to calculate Entropy :
Let’s first look at the formula for calculating the Entropy:
Here, p is the Probability of positive class and q is the Probability of negative class. Let’s say E1, E2, E3 are some features. We need to make a tree using one of these appropriate features as the parent node. Let’s suppose that E2 is the parent node and E1, E3 are leaf nodes.
Off 7, E2 have 5 positive input and 2 negative input. The E2 has been split into two leaf nodes (step 2). After the split, the data is being divided in such a way that E1 contains 2 positive and 1 negative . At the same time, E3 contains 3 positive and 1 negative. Now we calculate entropy for both the leaf E1 and E2 in order to find out that which one is to consider for next split. The node which has higher entropy value will be considered for the next split. The dashed line shows the further splits, meaning that the tree can be split with more leaf nodes.
Note : The value of entropy is always between 0 to 1.
Objective is to find out column and its threshold which gives maximum drop in entropy between two levels of nodes. Maximum drop in entropy is called Information Gain, this is being used for further splitting the nodes when you have multiple attributes.
Let us talk a bit about regularization which I have mentioned in the start of the article. There are few regularization parameters in decision tree which we can use to control the size of decision tree , like :
- max_depth : maximum length of path from root to leaf
- min_sample_split : limit to stop further splitting of nodes when number of observation in node is less than given value.
- min_sample_leaf: minimum number of sample a leaf node must have. When a leaf node has too few observations further split will result in over-fitting.
- max_feature_size: maximum number of features evaluated before splitting.
Post Pruning — Measure of countering Over-fitting
Pruning means simplifying/compressing and optimizing a Decision tree by removing sections of the tree that are not-critical and redundant to classify instances.
Pruning techniques ensure that decision trees tend to generalize better on ‘unseen’ data. A Decision tree can be pruned before or/and after constructing it. However, either one of the pruning methods is sufficient to remove over-fitting. In this article we will only focus on Cost Complexity Pruning.
Post pruning a Decision tree ‘prunes’ the tree after it has fully grown. It removes a sub-tree and replaces it with a leaf node, the most frequent class of the sub-tree determines the label of the new leaf.
Mathematically Cost Complexity Pruning for a tree is given by :
Where,
- R(T) — Total training error of leaf nodes
- |T| — The number of leaf nodes
- α — complexity parameter(a whole number)
Objective is to minimize the above cost complexity measure. Intuitively, by trying to reduce the training error(R(T)) , it leads to relatively larger trees(more leaf nodes). A tree with more leaf nodes which fits very well to the training data, thereby resulting in over-fitting. The parameter α reduces the complexity of the tree by controlling the number of leaf nodes, which eventually reduces over-fitting.
Extending the cost complexity measure for a single node t, we get the following equation:
where R(t) is the training error at node t.
For a sub-tree rooted at t(internal node), we have the training error at the node t to be the sum of the training error of its leaf nodes. Mathematically,
where L is the set of leaf nodes of sub-tree Tₜ.
The cost complexity measure for sub-tree Tₜ is:
We can infer that if the training error of node t and the training error of the sub-tree rooted at t(Tₜ), are equal, then the split made at node t is redundant. In such a case the sub-tree rooted at t(Tₜ) can be pruned. Generalizing this to cost complexity measure, we can get some value of α, where both cost complexity measure of node t and that of sub-tree rooted at t(Tₜ) are equal.
Solving for the value of α(effective alpha), we get:
This α is called as effective alpha, which gives the optimal sub-tree.
The effective alpha is computed for each non-terminal node of the tree and the node with the minimum alpha is pruned. This procedure is repeated repeatedly until there is only one node , i.e the root node is left.
This method returns a set of effective alphas and a set of optimal sub-trees corresponding to these alphas. With zero value of α, the biggest tree will be chosen. When α approaches infinity, we get only the root node as the tree.
The DecisionTreeClassifier class in sklearn provides ccp_alpha as a parameter for post pruning. The parameter ccp_alpha provides a threshold for effective alphas, i.e. the process of pruning continues until the minimal effective alpha of the pruned tree is not greater than ccp_alpha. The DecisionTreeClassifier class also provides a method cost_complexity_pruning_path which implements the pruning process and returns the effective alphas(and the corresponding impurities of there pruned trees).
If you find this blog helpful, kindly share and be a helping hand for anyone who would be interested in learning data science.