Classification And Regression Tree (CART)

Understanding CART

Gajendra
12 min readJun 29, 2022

CART is an algorithm that uses decision trees to solve classification and regression problems. The core algorithm for building decision tree is called ID3 (Iterative Dichotomiser 3).

ID3 (Iterative Dichotomiser 3)

The ID3 algorithm begins with the original dataset as the root node. On each iteration of the algorithm, it iterates through every unused attribute/feature of the dataset and calculates the entropy or the information gain of that attribute. It then selects the attribute/feature which has the smallest entropy (or largest information gain) value. The dataset is then split or partitioned by the selected attribute/feature to produce subsets of the data. The algorithm continues to recurse on each subset, considering only attributes never selected before.

Decision Tree

Decision Tree algorithm belongs to the family of supervised learning algorithms. Unlike other supervised learning algorithms, the decision tree algorithm can be used for solving regression and classification problems.

Decision Tree

The selection of the input variables/features that decides the specific split for each node is selected in a greedy way to minimize the cost function. In this greedy way, a number of split points with different set of variables/features are considered and that split is chosen which results in the minimum value for the Gini Index (i.e. more homogenous splits) at that node. This process is carried out recursively for all sub-nodes in the tree.

Types

Decision Tree for Classification: Decision Tree which has categorical target variable (Classification Problems).

Decision Tree for Regression: Decision Tree which has continuous target variable (Regression Problems).

Decision Tree for Classification

Important Concepts

Entropy

It is a measure of uncertainty or diversity embedded in a random variable. It is a common way to measure the impurity.

Entropy is measured between 0 and 1, (Depending on the number of classes in your dataset), entropy can be greater than 1 but it means the same thing , a very high level of disorder.

Suppose a variable X can take one of n values, v1, v2, . . ., vn

Probabilities for each of these values can be represented as,

Probabilities

With these probabilities we can calculate Entropy as,

Entropy

Examples

Let’s look at some of the examples and calculate Entropy.

Example 1

Here we have 30 data points out of which 16 are Green and 14 are Pink.

Green and Pink Classification

The Entropy for the dataset can be calculated as,

Entropy

Here, we have a balanced dataset and the entropy is quite high, close to 1.

Let’s look at another example.

Example 2

Here we have 12 data points and they all belong to same category, Green.

Homogeneous Data
Entropy

Here, we have a unbalanced dataset and the entropy is quite low, 0.

Example 3

Here we have 6 data points and they all belong to, Green and Pink category, equally.

Heterogeneous Data

Here, we have a perfectly balanced dataset and the entropy is quite high, 1.

Looking at all 3 examples we can say lower the impurity, the lesser the entropy.

Entropy is needed in decision tree to calculate purity on each nodes after the split. More purity indicates better classification after the split. This is how decision tree decides how good the classification is.

‘-‘ sign is used in the entropy formula to negate the negative values.

In decision trees entropy is used to calculate the information gain.

The information gain is based on the decrease in entropy after a dataset is split on an attribute.

Information Gain

The information gain is based on the decrease in entropy after a dataset is split on an attribute/feature. Constructing a decision tree is all about finding attribute/features that returns the highest information gain (i.e., the most homogeneous branches)

  • We want to determine which attribute/features in a given set of training feature vectors is most useful for discriminating between the classes to be learned.
  • Information gain tells us how important a given attribute/features of the feature vectors is.
  • We will use it to decide the ordering of attributes/features in the nodes of a decision tree.
Information Gain
Weighted Entropy

In a simple term we can say,

Information Gain

Our aim is to maximize information gain at each split.

Let’s look at some examples and calculate information gain.

Examples

Example 1

Here we have a 30 data points, 16 Green and 14 Pink.

Dataset Representation

Entropy of Root Node,

Entropy — Root Node

After, the split we have two nodes,

  • Left Node — 4 Green and 13 Pink
  • Right Node — 12 Green and 1 Pink

Entropy of Left and Right Nodes,

Entropy — Left Node
Entropy — Right Node

Calculate the Weighted or Average Entropy,

Weighted or Average Entropy

For Left Node we have -

Left Node

For Right Node we have -

Right Node

Plugging all the values we get Weighted or Average Entropy as,

Weighted or Average Entropy

So Information Gain for this split is,

Information Gain

Now that we know how to calculate information gain lets see how decision tree uses it to split nodes.

Example 2

Let’s assume we have a dataset with features X1, X2 and X3 and label Y,

Dataset

To find the best split we will try each of these features and calculate the information gain. We will then select the split with the maximum information gain.

Split on X1,

Split on X1

Calculate Gain and Entropy,

Split on X1

Split on X2,

Split on X2

Calculate Gain and Entropy,

Split on X2

Split on X3,

Split on X3

Calculate Gain and Entropy,

As we can see we get the highest information gain when we split on features X2 and X3. So, we can choose either of these features to split our tree.

Other thing to notice is that for the split on X2 we also get the lowest entropy on our nodes indicating that most of the data have been classified using this split and we don’t need to split the tree any further.

So we can say it is more beneficial to split the tree on feature X2 than any other features.

In decision tree information gain is used to find the best split.

Gini Index or Gini Impurity

Gini index or Gini Impurity measures the degree or probability of a particular variable being wrongly classified when it is randomly chosen.

The degree of Gini Index varies between 0 and 1.

If Gini = 0 — All elements belong to a certain class or if there exists only one class

If Gini = 1 — The elements are randomly distributed across various classes

If Gini = 0.5 — Equally distributed elements into some classes

Gini

Examples

Consider the dataset below.

Weather Dataset

Let’s look at the Outlook, it has

Outlook

Gini — Sunny

Gini — Sunny

Gini — Overcast

Gini — Overcast

Gini - Rain

Gini — Rain

Weighted Sum of Gini Indices

Weighted Sum of Gini Indices

Similarly, calculate the Gini for other features and select the feature with the least Gini index as a root node.

Let’s assume we get below Gini values for each of the features.

Gini Values

In a classification problem the Feature with least Gini Index is selected as a root node.

Based on the Gini values decision tree for the weather table will look like,

Temperature (1) > Outlook (2) > Humidity (3) > Windy (4)

Decision Tree

Steps of CART Algorithms (Classification)

  1. Start with the complete training dataset, M features and N rows.
  2. Pick a feature, randomly, say X(m)
  3. Split the data in to 2 nodes, n1 and n2, on one of the values of the same feature, example, > and <, yes/no, etc.
  4. Calculate entropy for each node root, n1 and n2.
  5. Calculate the information gain for the split
  6. Repeat the process for other features and pick the one that maximizes information gain

Decision Tree for Regression

The core algorithm for building decision trees called ID3, which employs a top-down, greedy search through the space of possible branches with no backtracking. The ID3 algorithm can be used to construct a decision tree for regression by replacing Information Gain with Standard Deviation Reduction.

Important Concepts

Average

Gives the value on the leaf nodes.

Average

Standard Deviation

Standard Deviation is for tree building (branching)

Standard Deviation

Coefficient of Variation

Coefficient of Deviation is used to decide when to stop branching. We can use Count (n) as well.

Coefficient of Variation

Probability

Probability

Relative Standard Deviation

Relative Standard Deviation

Standard Deviation Reduction

Standard Deviation Reduction

Example

Let’s take an example of the dataset below,

Dataset

Calculate Average, Standard Deviation and Coefficient of Standard for Hour

Hours

Average — Standard Deviation — Coefficient of Standard Deviation

Let’s start with feature, Outlook, and do the calculations, we can repeat the same calculation for other features later.

Outlook

Outlook

Sunny

Outlook — Sunny

Rain

Outlook — Rain

Overcast

Outlook — Overcast

Relative Standard Deviation (Hours, Outlook)

Relative Standard Deviation (Hours, Outlook)

Standard Deviation Reduction

Standard Deviation Reduction

Similarly, calculate the SDR for other features.

Standard Deviation Reduction

The feature with the highest SDR is chosen for the decision node. In here it is Outlook.

The dataset is divided based on the values of the selected feature. This process is run recursively on the non-leaf branches, until all data is processed.

In practice, we need some termination criteria. For example, when coefficient of variation(CV) for a branch becomes smaller than a certain threshold, say 20%, and/or when number of nodes in a branch is less, say less than 2.

Threshold

  • Coefficient of Variation (CV) < 20%
  • Number of nodes in a branch < 2
Coefficient of Variation
Coefficient of Variance

Only Overcast has CV less than 20% which means we don’t have to split the node any further. But, Sunny and Rain, has CV greater than 20% so we will have to split these nodes further.

Outlook > Overcast > 43.5

Since Overcast doesn’t needs to be split any further we can get rid of the corresponding data.

The new dataset will look like,

Dataset

We will repeat the same process to find the next feature to split by for each of the branches, Sunny and Rain.

Sunny

Outlook — Sunny

Let’s assume, for Sunny, we get the SDR for each feature as,

SDR

Since Windy has the highest SDR, it will be our next feature to split by,

Sunny — Windy

Sunny — Windy
Decision Tree

At this point we can see that our threshold condition is not being satisficed as number of nodes on branch Windy is less than 2. So we will stop here and just calculate the average which is our final output for this branch.

Now lets look at the Rain feature.

Rain

Outlook — Rain

Let’s assume, for Rain, we get the SDR for each feature as,

SDR

Since Temperature has the highest SDR, it will be our next feature to split by,

Rain — Temperature

Rain — Temperature
Decision Tree

At this point we can see that our threshold condition is not being satisficed as number of nodes on branch Temperature is less than 2. So we will stop here and just calculate the average which is our final output for this branch.

Steps of CART Algorithms (Regression)

  1. Start with the complete training dataset, M features and N rows.
  2. Calculate standard deviation reduction w.r.t target for each features
  3. Select the feature with highest standard deviation reduction.
  4. Split the data in to nodes on one of the values of the same feature, example, > and <, yes/no, etc.
  5. Calculate coefficient of variance for each node and check the threshold.
  6. Split the nodes further if it satisfied the threshold else stop.
  7. Calculate standard deviation reduction w.r.t other features for each nodes.
  8. Select the feature with highest standard deviation reduction.
  9. Repeat until threshold is not satisfied.

Assumptions

Here are few important assumptions we make for decision trees -

  • Whole training set is considered as root.
  • Features should be categorical. If continuous, the feature is discarded prior to building the trees.

Advantages

  • Decision trees are simple to understand and interpret.
  • Its able to handle both numerical and categorical data.
  • Performs well even with large dataset and requires little data preparation.
  • Naturally deemphasizes irrelevant features.

Disadvantages

  • A small change in the training data can result in a large change in the tree and consequently the final predictions.
  • Practical decision-tree learning algorithms are based on heuristics (greedy algorithm). Such algorithms cannot guarantee to return the globally optimal decision tree.
  • Overfitting: Decision-tree learners can create over-complex trees that do not generalize well from the training data.

I hope this article provides you with a good understanding of CART Algorithm.

If you have any questions or if you find anything misrepresented please let me know.

Thanks!

Credit: Andrew Ng, StatQuest and Krish Naik — special thanks to these great contents.

--

--

Gajendra

| AWS MLS, SAA, CLF | MIT - ADSP | Software Engineer | Data Scientist | Machine Learning | Artificial Intelligence | Hobby Blogger |