Decision Tree Regression

4 min readJun 3, 2020

Before knowing about decision tree regression let’s discuss the decision tree. The decision tree is the type of supervised learning. It predicts the output using nodes. The way of selecting which attribute will be used as a leaf node, root node or child node can be done using two ways that are:-

CART (Classification and Regression Trees) → uses the Gini Index(Classification) as a metric.
ID3 (Iterative Dichotomiser 3) → uses Entropy function and Information Gain as metrics.

It’s time to discuss our target. Decision tree regression is used for continuous output problem. We can’t work using the Gini impurity for decision tree regression because it works mainly on discrete data that is why it is solved by ID3 using Standard Deviation as metrics.

Using an example of a dataset in which we need to decide the number of hours played.

We use standard deviation to calculate the identity of a numerical unit. If the numerical unit is completely similar its standard deviation is zero.

Then we try to find standard attribution for two attributes(target and predictor):

Standard Deviation Reduction

It is based on the drop in standard deviation after a dataset is split on an attribute. Creating a decision tree is all about discovering attribute that returns the highest standard deviation reduction (i.e., the most similar branches).

The standard deviation of the target(house) is calculated i.e., 9.32.
Now we will calculate standard deviation fro every branch(attribute).

The attribute with the largest standard deviation reduction is chosen for the decision node.

Outlook has the largest standard deviation reduction compare to other hence it is selected as a root node.

Now the dataset is divided based on the values of the selected attributes. This process will run recursively on non-lead branches until all data is processed.

In practice, we need some termination criteria. For example, when the coefficient of deviation(CV) for a branch becomes smaller than a certain threshold (e.g., 10%) and or when too few instances (n) remain in the branch (e.g., 3).

“Overcast” subset does not need any further splitting because its CV (8%) is less than the threshold (10%). The related leaf node gets the average of the “Overcast” subset.

However, the “Sunny” branch has a CV (28%) more than the threshold (10%) which needs further splitting. We select “Windy” as the best node after “Outlook” because it has the largest standard deviation reduction.

Because the number of data points for both branches (FALSE and TRUE) is equal or less than 3 we stop further branching and assign the average of each branch to the related leaf node.

Moreover, the “rainy” branch has a CV (22%) which is more than the threshold (10%). This branch needs further splitting. We select “Windy” as the best node because it has the largest SDR.

Because the number of data points for all three branches (Cool, Hot and Mild) is equal or less than 3 we stop further branching and assign the average of each branch to the related leaf node.

When the number of instances is more than one at a leaf node we calculate the average as the final value for the target.

— — — — — — — — — — — — — -THANK YOU — — — — — — — —

Decision Tree Regression

Written by Rishabh Jain