Overview About The Decision Tree Model

Sai Varun Immidi
Analytics Vidhya
Published in
14 min readAug 23, 2020
Image Source: Zylotech

Decision Trees are one of the highly interpretable models and can perform both classification and regression tasks. As the name suggests Decision Trees are tree-like structure model which resembles an upside-down tree. At this point, you might be having a question like we already have classical machine learning models like linear regression and logistic regression to perform regression and classification tasks in such case what is the necessity of having another model like Decision Tree. The answer to this question is to perform classical linear models we need to make sure that the data which is used for training the model is free from all irregularities like missing values, outliers needed to be handled, multicollinearity needs to be addressed. A whole lot of data preprocessing needs to be done before. Whereas in Decision Trees we no need to perform any sort of data preprocessing beforehand. Decision Trees are robust enough to handle all such kinds of problems to reach a decision. Also, Decision Trees are capable of handling nonlinear data that classical linear models fail to handle. Hence Decision Trees are diverse enough to perform both regression and classification tasks. A whole set of advantages and disadvantages associated with Decision Trees can be discussed in detail in the latter part of this article. Before that let’s start understanding Decision Trees.

Decision Trees build the tree by asking a series of questions to the data to reach a decision. Hence it is said that Decision Trees mimic the human decision process. During the tree-building process, it divides the entire data into subsets of data until it reaches a decision. Let’s understand a few terminologies associated with Decision trees to better understand Decision Trees.

Few Terminologies in Decision Trees:

Root Node: The topmost node of the tree corresponds to the Root Node. All the data will be present at this Root Node. The arrows in the decision tree are generally pointed away from this Root Node.

Leaf Node or Terminal Node: Also called as Terminal Node. If a particular node cannot be split further that it is considered as Leaf Node. The Decisions or the Predictions are held by this Leaf Node. The arrows in the decision tree are generally pointed to this Leaf Node.

Internal Node or Decision Node: The nodes between the root node and the leaf node are said to be internal nodes. These nodes can be split further into sub-nodes.

Please refer to the below image for a better understanding of the above-mentioned terminology.

Decision Tree Terminologies

Decision Trees are said to be highly interpretable because of its tree-like structure. In order to interpret the Decision Tree, we transverse down the tree satisfying the conditions associated with the nodes to reach a decision. The term Decision refers to a prediction it can be either a class label if performing a classification task or a value if performing a regression task. Upon interpreting the Decision Tree, we will get to know about the associated features which lead to a particular decision. Though we will not be able to interpret the linear relation between the feature variable and the target variable and its directional effect. If interpreting the model is of major concern, then Decision Tree would be on the top. On a whole, we can think of interpreting a Decision Tree using some logical IF then ELSE statements. Then using AND logical operator we can connect the condition associated with a particular node with the previous node condition.

Having understood how to interpret Decision Tree we next understand the high-level tree building process.

The steps involved in the tree building process is as follows:

1. Recursive partition of the data into multiple subsets.
2. At each node identifying the variable and the rule associated with the variable for the best split.
3. Applying the split at that node using the best variable using the rule defined for the variable.
4. Repeating steps 2 and 3 on the sub-nodes.
5. Repeating this process until we reach a stopping condition.
6. Assigning the decisions at the leaf nodes based on the majority class label present at that node if performing a classification task or considering the average of the target variable values present at that leaf node if performing a regression task.

There exists different tree-building algorithms like CART, CHAID, ID3, C4.5, C5.0, etc. In each of the building algorithm, the criteria considered in selecting the best feature which provides the best split might be different like CART algorithm uses Gini Index impurity measure to determine the best feature which provides the best split. Similarly, ID3 uses Information gain, C4.5 uses Gain Ratio likewise for other algorithms as well. But the overall tree-building algorithm remains the same as mentioned above.

At this point, you might have questions like how to select the features which provide the best split. How to define rule associated with the feature for providing the best split and finally what is the stopping condition. These questions will be answered in the latter part of this article.

Few things to be noted about Decision Tree building, these Decision Trees follow a top-down approach in building the tree and also said to have a Greedy approach. Greedy approach because at every split of the node these Decision Trees are concerned about the immediate result after the split. They do not take into consideration of the effect of split after two or three nodes. Hence these Decision Trees are said to have a Greedy approach. One important implication of the Greedy approach is that it makes Decision Trees high variance model meaning a small change in input data will result in a complete change in a tree structure and the final decisions.

Having some high-level understanding of Decision Trees and their model building process. Let’s address all of our questions one by one which we have come across during the model building process.

How to select features which provide us with the best split at a node?

Before addressing this question we need to have some understanding related to the homogeneity associated with a node in case of classification setting. The same notion can be extrapolated for a regression setting. As the name suggests homogeneity refers to something of the same kind. The same definition can be extended in the case of Decision Trees as well. A particular node is said to be homogenous if the class labels associated at the node belong to a single class if performing a classification activity. If performing a regression activity then we speak in terms of variance associated with a node.

In case of classification, the term best split refers to obtaining as homogenous as possible sub-nodes or child nodes upon splitting a parent node. The class labels of the target variable associated with the data points present at those sub nodes or child nodes should be belonging to one of the class labels then the sub-nodes obtained are said to be as homogenous as possible. In the case of regression, the term best split refers to obtaining low variance nodes upon splitting the parent node. Upon computing, Mean Square Error we get to know about the variance associated with the data points present at a node. Let’s now focus on how to identify the feature associated with some rule to split a node which results in the best split.

In the case of classification activity, a particular feature is selected which results in best split based on the difference in impurity created or purity gain obtained upon splitting or how well the feature is able to classify the class labels associated with the target variable or results in as homogenous as possible sub-nodes. At this point, we might get a question like how do we quantify the homogeneity of a node. Using impurity measures we can quantify the homogeneity of a node, some of the popular metrics are Classification Error, Gini Index, and Entropy. Since these metrics account for the impurity of a node, lower the metric value higher the homogeneity of a node. Let’s look at the impurity measures in detail.

Impurity measures to measure Homogeneity at a node:

Classification Error is the error made in assigning the class labels to the data points based on the majority class label. Upon computing the probabilities associated with each of the class labels we will get to know about the majority class label then assigning the majority class label to all the data points. Doing so misclassification is made by assigning the low probability class label data points with the majority class label.

Gini Index accounts for any random data point being misclassified. The value varies between 0 and 0.5. Lower the Gini index the lesser chances of any random data point will be misclassified which will help in assigning decisions or outcomes to leaf nodes without any ambiguity. If there exist all the data points belong to one single class label in such Gini Index value will be 0 as the data points are completely homogenous with a single class label associated with them. Similarly, if there exists an equal class distribution of data points in such case the Gini Index will be maximum that is 0.5 as there exists complete ambiguity in class labels and the data points are said to be highly non-homogenous.

On the other hand, Entropy which is derived from thermodynamics and Information Theory accounts for the degree of disorder present in data points at a node. Lower the entropy value lower the disorder present at a node in the class labels of the target variable. The value ranges between 0 and 1. If all the data points belong to one single class label in such case the Entropy value will be minimum that is 0 as there exists least disorder in the class labels of the target variable. Similarly, if there exists an equal distribution of class labels in data points in such case the Entropy value will be maximum that is 1 as there exists complete disorder in the class labels of the target variable and the data is to be completely non-homogenous.

The formulas to compute the Classification Error, Gini Index, and Entropy as follows:

Classification Error
Gini Index
Entropy

Where pi corresponds to the probability of the data point belonging to ith class label and k accounts for different class labels.

Gini Index and Entropy are widely used in computing the homogeneity of a node when compared to Classification Error as Classification Error is not as sensitive as other metrics in identifying the homogeneity of a node.
Numerically Gini Index and Entropy are similar to each other. When computed Scaled Entropy which is Entropy/2, the trajectories of Scaled Entropy and Gini Index will be almost touching with each other. For better understanding please refer to the below image:

Impurity Measures variation

Hence in order to select the feature which provides the best split, it should result in sub-nodes that have a low value of any one of the impurity measures or creates a maximum difference in impurity measure when calculated the impurity measures before and after the split meaning creating maximum purity gain. To compute the difference in impurity measures we first compute the impurity measure of the node before splitting and then compute the weighted average of impurity measure of the sub-nodes which are obtained upon splitting the node with some feature which is assigned with some rule to split. Finally computing the difference between impurity measures computed before and after the split will result in purity gain. We aim to have higher purity gain as it results in more homogenous sub-nodes. We select the feature among all other features to split a node that creates the highest purity gain.

The same thought process can be extended to the regression setting in which we select the feature to split the node which results in low variance sub-nodes.

Let’s move forward to answer our next question which is How to identify rule associated with a feature to split a node.

How to identify rule associated with a feature to split a node?

To answer this question we will restrict our discussion to the CART tree building algorithm and explore different methods in defining the rule to a feature in order to split a node.

When building the tree using a CART algorithm, every node is split into two parts meaning it performs binary split at every node. The other specification of the CART algorithm is for every node split there exists only univariate condition associated with a feature in order to split the node meaning at every node using only one rule associated with the feature we proceed to split the node. No multiple rules associated with different features are used to split a node. In order to perform a multi-way split, there are other popular algorithms like ID3, C4.5, C5.0, CHAID.

Let’s come back to our question which is how to define a rule associated with a feature that provides the best split. If the predictor variable is a nominal categorical variable having some k classes in it, the possible number of splits using each of the class are (2^(k-1)-1). Among all possible splits the split which results in homogeneous sub-nodes is taken into consideration.
If the predictor variable is an ordinal categorical variable having n classes in it, the possible number of splits are (n-1). Among all possible splits the split which results in homogenous sub-nodes is taken into consideration.
If the predictor variable is continuous or numerical there are some discretization techniques available to select the rule from the numerical variable. One of the discretization technique is to arrange the variable in ascending order array and check on each of the numerical value by performing split using each numerical value. Finally, the one which provides the best split is taken into consideration. By following other discretization methods like considering mean, percentiles we can define rule associated with the numerical feature to split the node.

Now among all the features associated with their rules which result in best split only one feature is selected finally to split the node which generates greater purity gain among all other features associated with their best rules.

After having an understanding of the tree-building process and some concepts surrounding it. Let’s answer our last question which is What is the stopping condition.

What is the stopping condition?

In general, the tree-building process will continue until all the features have been exhausted to split the nodes or all the leaf nodes have been formed such that each of the leaf nodes corresponds to minimal data points of the training data set.

Upon allowing the tree to grow to its complete logical end the tree becomes a high variance model meaning it will overfit on training data by memorizing every data point present in the training data. Once it becomes a high variance model in such case a small change in training data will alter complete tree structure as a result all the decisions associated with leaf nodes might change. Hence such trees are not trustworthy to make any decisions.

Stopping conditions can also be defined by assigning some values to the hyperparameters by hyperparameter tuning. Before understanding more let’s get an understanding of what are hyperparameters and what is hyperparameter tuning. Hyperparameters are something that is defined by modeler during the process of model building. The learning algorithm takes into consideration of these hyperparameters which are passed by the modeler before producing a final model. The model is not capable of identifying these hyperparameters implicitly during the model building. To find out the optimum hyperparameters we perform hyperparameter tuning. Some of the hyperparameters which control the tree growth are max_depth, max_features, min_samples_leaf, min_samples_split, criterion, etc. Upon defining hyperparameter we can control tree from overfitting. Let’s look at the methods to control overfitting of trees.

Methods to control Overfitting of trees:

Decision Trees have high chances of overfitting on the training data as a result they become high variance models. To avoid overfitting issues in trees we follow some of the methods:
- Tree Truncation or Pre Pruning Strategies
- Post Pruning Strategies

Tree Truncation: Also, called as Pre Pruning Strategies as we control the tree from overfitting during the model building itself. One of the naive tree truncation strategies is to define a threshold homogeneity value in case of a classification activity. Which will be used to compare homogeneity at every node before splitting. If the homogeneity of the node is less than the threshold value then we split the node further. Similarly, if the homogeneity of the node is greater than the threshold value then we convert the node as a leaf node. Other tree truncation strategies are by defining hyperparameters during the model building we can control the tree from overfitting.

Post Pruning: In Post Pruning, we allow the tree to its complete logical end and then perform pruning from the bottom of the tree. Some of the popular post pruning methods are Reduced Error Pruning, Cost Complexity Pruning. We perform pruning of nodes until there exists no reduction in purity gain.

Post Pruning

Generally, Tree Truncation strategies are preferred over Post Pruning strategies as Post Pruning methods are overkill process.

We have answered all our questions that were lingering in our minds during the tree building process. Let’s finish off this article by discussing the advantages and disadvantages of Decision Trees.

Advantages of Decision Trees:

  1. Versatile: Decision Trees can be used for building Classification and Regression model building.
  2. Fast: Upon defining the hyperparameters by hyperparameter tuning the Decision Tree building process is significantly fast.
  3. Minimal Data Preprocessing: No much data preprocessing is needed like scaling, outlier treatment, etc.
  4. Easy Interpretable: Decision Tree is like a flow chart that can be easily interpretable without any mathematical understanding.
  5. Able to handle non-linear relationships: If there exists any non-linear relationship between the predictor variable and target variable using Decision Trees the non-linear relationship can be captured upon segmenting the data into smaller subsets and then assigning a single decision to the entire subset using a leaf node.
  6. Handles Multicollinearity: Decision Trees can handle multicollinearity by considering only those features among multicorrelated features for node splitting as there is no meaning in considering the other multicorrelated feature as well.
  7. Non-parametric Model: Decision trees doesn’t take into consideration of any assumption and distributions related to the predictor variables or the errors associates with the predictions. Hence it is said to be a Non-parametric Model.
  8. Feature Importance: Feature importance can be obtained upon building a Decision Tree with which we can know which are the significant features that made a significant contribution in marking prediction about the target variable. By knowing these feature importance we can perform model-based dimensionality reduction by considering only the significant features.

Disadvantages of Decision Trees:

  1. Loss of Inference: Using Decision trees we can get to know about the decision associated with a data point and also we can know about the factors leading to a particular decision holding by a leaf node. But we will not be knowing about the linear relationship between the predictor variable and the target variable as a result we cannot make any inferences about the population.
  2. Loss of the numerical nature of the variable: If there exists any numerical variable upon tree building the entire subset of the numerical variable is being assigned with a single prediction value as a result the information present in the numerical variable is taken into consideration.
  3. Overfitting: If the tree is allowed to grow to its complete logical end then the tree overfits on the training data. Though overfitting issues can be controlled by performing the above-mentioned methods but it is an inherited problem with Decision Trees if not controlled.

References:

  1. upGrad Learning Platform
  2. Decision Tree image source: https://www.zylotech.com/blog/common-algorithms-in-marketing-decision-trees
  3. Impurity Measures variation image source: bogotobogo.com
  4. Impurity Measures Formulas image source: upGrad Learning Platform
  5. Post Pruning image source: https://alanjeffares.wordpress.com/tutorials/decision-tree/

--

--