ID3 (Iterative Dichotomiser 3)
The ID3 algorithm begins with the original dataset as the root node. On each iteration of the algorithm, it iterates through every unused attribute/feature of the dataset and calculates the entropy or the information gain of that attribute. It then selects the attribute/feature which has the smallest entropy (or largest information gain) value. The dataset is then split or partitioned by the selected attribute/feature to produce subsets of the data. The algorithm continues to recurse on each subset, considering only attributes never selected before.
Decision Tree
Decision Tree algorithm belongs to the family of supervised learning algorithms. Unlike other supervised learning algorithms, the decision tree algorithm can be used for solving regression and classification problems.
The selection of the input variables/features that decides the specific split for each node is selected in a greedy way to minimize the cost function. In this greedy way, a number of split points with different set of variables/features are considered and that split is chosen which results in the minimum value for the Gini Index (i.e. more homogenous splits) at that node. This process is carried out recursively for all sub-nodes in the tree.
Types
Decision Tree for Classification: Decision Tree which has categorical target variable (Classification Problems).
Decision Tree for Regression: Decision Tree which has continuous target variable (Regression Problems).
Decision Tree for Classification
Important Concepts
Entropy
It is a measure of uncertainty or diversity embedded in a random variable. It is a common way to measure the impurity.
Entropy is measured between 0 and 1, (Depending on the number of classes in your dataset), entropy can be greater than 1 but it means the same thing , a very high level of disorder.
Suppose a variable X can take one of n values, v1, v2, . . ., vn
Probabilities for each of these values can be represented as,
With these probabilities we can calculate Entropy as,
Examples
Let’s look at some of the examples and calculate Entropy.
Example 1
Here we have 30 data points out of which 16 are Green and 14 are Pink.
The Entropy for the dataset can be calculated as,
Here, we have a balanced dataset and the entropy is quite high, close to 1.
Let’s look at another example.
Example 2
Here we have 12 data points and they all belong to same category, Green.
Here, we have a unbalanced dataset and the entropy is quite low, 0.
Example 3
Here we have 6 data points and they all belong to, Green and Pink category, equally.
Here, we have a perfectly balanced dataset and the entropy is quite high, 1.
Looking at all 3 examples we can say lower the impurity, the lesser the entropy.
Entropy is needed in decision tree to calculate purity on each nodes after the split. More purity indicates better classification after the split. This is how decision tree decides how good the classification is.
‘-‘ sign is used in the entropy formula to negate the negative values.
In decision trees entropy is used to calculate the information gain.
The information gain is based on the decrease in entropy after a dataset is split on an attribute.
Information Gain
The information gain is based on the decrease in entropy after a dataset is split on an attribute/feature. Constructing a decision tree is all about finding attribute/features that returns the highest information gain (i.e., the most homogeneous branches)
- We want to determine which attribute/features in a given set of training feature vectors is most useful for discriminating between the classes to be learned.
- Information gain tells us how important a given attribute/features of the feature vectors is.
- We will use it to decide the ordering of attributes/features in the nodes of a decision tree.
In a simple term we can say,
Our aim is to maximize information gain at each split.
Let’s look at some examples and calculate information gain.
Examples
Example 1
Here we have a 30 data points, 16 Green and 14 Pink.
Entropy of Root Node,
After, the split we have two nodes,
- Left Node — 4 Green and 13 Pink
- Right Node — 12 Green and 1 Pink
Entropy of Left and Right Nodes,
Calculate the Weighted or Average Entropy,
For Left Node we have -
For Right Node we have -
Plugging all the values we get Weighted or Average Entropy as,
So Information Gain for this split is,
Now that we know how to calculate information gain lets see how decision tree uses it to split nodes.
Example 2
Let’s assume we have a dataset with features X1, X2 and X3 and label Y,
To find the best split we will try each of these features and calculate the information gain. We will then select the split with the maximum information gain.
Split on X1,
Calculate Gain and Entropy,
Split on X2,
Calculate Gain and Entropy,
Split on X3,
Calculate Gain and Entropy,
As we can see we get the highest information gain when we split on features X2 and X3. So, we can choose either of these features to split our tree.
Other thing to notice is that for the split on X2 we also get the lowest entropy on our nodes indicating that most of the data have been classified using this split and we don’t need to split the tree any further.
So we can say it is more beneficial to split the tree on feature X2 than any other features.
In decision tree information gain is used to find the best split.
Gini Index or Gini Impurity
Gini index or Gini Impurity measures the degree or probability of a particular variable being wrongly classified when it is randomly chosen.
The degree of Gini Index varies between 0 and 1.
If Gini = 0 — All elements belong to a certain class or if there exists only one class
If Gini = 1 — The elements are randomly distributed across various classes
If Gini = 0.5 — Equally distributed elements into some classes
Examples
Consider the dataset below.
Let’s look at the Outlook, it has
Gini — Sunny
Gini — Overcast
Gini - Rain
Weighted Sum of Gini Indices
Similarly, calculate the Gini for other features and select the feature with the least Gini index as a root node.
Let’s assume we get below Gini values for each of the features.
In a classification problem the Feature with least Gini Index is selected as a root node.
Based on the Gini values decision tree for the weather table will look like,
Temperature (1) > Outlook (2) > Humidity (3) > Windy (4)
Steps of CART Algorithms (Classification)
- Start with the complete training dataset, M features and N rows.
- Pick a feature, randomly, say X(m)
- Split the data in to 2 nodes, n1 and n2, on one of the values of the same feature, example, > and <, yes/no, etc.
- Calculate entropy for each node root, n1 and n2.
- Calculate the information gain for the split
- Repeat the process for other features and pick the one that maximizes information gain
Decision Tree for Regression
The core algorithm for building decision trees called ID3, which employs a top-down, greedy search through the space of possible branches with no backtracking. The ID3 algorithm can be used to construct a decision tree for regression by replacing Information Gain with Standard Deviation Reduction.
Important Concepts
Average
Gives the value on the leaf nodes.
Standard Deviation
Standard Deviation is for tree building (branching)
Coefficient of Variation
Coefficient of Deviation is used to decide when to stop branching. We can use Count (n) as well.
Probability
Relative Standard Deviation
Standard Deviation Reduction
Example
Let’s take an example of the dataset below,
Calculate Average, Standard Deviation and Coefficient of Standard for Hour
Hours
Let’s start with feature, Outlook, and do the calculations, we can repeat the same calculation for other features later.
Outlook
Sunny
Rain
Overcast
Relative Standard Deviation (Hours, Outlook)
Standard Deviation Reduction
Similarly, calculate the SDR for other features.
The feature with the highest SDR is chosen for the decision node. In here it is Outlook.
The dataset is divided based on the values of the selected feature. This process is run recursively on the non-leaf branches, until all data is processed.
In practice, we need some termination criteria. For example, when coefficient of variation(CV) for a branch becomes smaller than a certain threshold, say 20%, and/or when number of nodes in a branch is less, say less than 2.
Threshold
- Coefficient of Variation (CV) < 20%
- Number of nodes in a branch < 2
Only Overcast has CV less than 20% which means we don’t have to split the node any further. But, Sunny and Rain, has CV greater than 20% so we will have to split these nodes further.
Since Overcast doesn’t needs to be split any further we can get rid of the corresponding data.
The new dataset will look like,
We will repeat the same process to find the next feature to split by for each of the branches, Sunny and Rain.
Sunny
Let’s assume, for Sunny, we get the SDR for each feature as,
Since Windy has the highest SDR, it will be our next feature to split by,
Sunny — Windy
At this point we can see that our threshold condition is not being satisficed as number of nodes on branch Windy is less than 2. So we will stop here and just calculate the average which is our final output for this branch.
Now lets look at the Rain feature.
Rain
Let’s assume, for Rain, we get the SDR for each feature as,
Since Temperature has the highest SDR, it will be our next feature to split by,
Rain — Temperature
At this point we can see that our threshold condition is not being satisficed as number of nodes on branch Temperature is less than 2. So we will stop here and just calculate the average which is our final output for this branch.
Steps of CART Algorithms (Regression)
- Start with the complete training dataset, M features and N rows.
- Calculate standard deviation reduction w.r.t target for each features
- Select the feature with highest standard deviation reduction.
- Split the data in to nodes on one of the values of the same feature, example, > and <, yes/no, etc.
- Calculate coefficient of variance for each node and check the threshold.
- Split the nodes further if it satisfied the threshold else stop.
- Calculate standard deviation reduction w.r.t other features for each nodes.
- Select the feature with highest standard deviation reduction.
- Repeat until threshold is not satisfied.
Assumptions
Here are few important assumptions we make for decision trees -
- Whole training set is considered as root.
- Features should be categorical. If continuous, the feature is discarded prior to building the trees.
Advantages
- Decision trees are simple to understand and interpret.
- Its able to handle both numerical and categorical data.
- Performs well even with large dataset and requires little data preparation.
- Naturally deemphasizes irrelevant features.
Disadvantages
- A small change in the training data can result in a large change in the tree and consequently the final predictions.
- Practical decision-tree learning algorithms are based on heuristics (greedy algorithm). Such algorithms cannot guarantee to return the globally optimal decision tree.
- Overfitting: Decision-tree learners can create over-complex trees that do not generalize well from the training data.
I hope this article provides you with a good understanding of CART Algorithm.
If you have any questions or if you find anything misrepresented please let me know.
Thanks!
Credit: Andrew Ng, StatQuest and Krish Naik — special thanks to these great contents.