Decision Tree for Regression — The Recipe

Akshaya Sriram
Analytics Vidhya
Published in
7 min readJun 3, 2020

Regression refers to identifying the underlying relationship between the dependent and independent variables when the dependent variable is continuous. Predicting a continuous variable can be done using many models. Some of the models used are Linear Regression, Decision Tree, k- Nearest Neighbors,etc.

A decision tree model is non-parametric in nature i.e., it uses infinite parameters to learn the data. It has the structure of a tree. Random Forest algorithm is a modified version of decision tree.A decision tree basically cuts the dataset into many subsets.

Just like cutting a cake…

Photo courtesy: https://alltimelists.com/youve-been-cutting-the-cake-all-wrong/

The Structure of a Decision Tree

In general, the nodes on which the dataset is branched or cut, are called Decision Nodes. The nodes where branching stops are called Leaf Nodes. The node on which the dataset is first cut is called Root Node. In another perspective, the data is grouped based on different conditions.

Yellow Nodes-Decision Nodes, Green Nodes-Leaf Nodes, Node highlighted in blue-Root Node

The learning aspect of a decision tree refers to identifying where to make these cuts on the dataset.

I write a line of code but how does it work?????

Let us discuss with an example.

Assume we are working on a problem to predict the quantity of chili powder in grams (dependent variable) we should be adding to a dish, given the independent variables to be considered are

i. Type of cuisine (Indian, Chinese, Thai),

ii. Presence of fresh chilies in the dish (0-No,1-Yes),

iii. Is the dish cooked for kids? (0-No,1-Yes),

iv. Base ingredient of the dish (Rice, Noodles, Vegetables),

v. Quantity of the dish prepared (in grams).

There may be many more factors to consider, but let’s limit to these five.

Our dataset

Other examples could have been used for explanation but what’s the thrill if you don’t relate it to something so familiar to you.

(Note to the reader that the dataset has been created for the purpose of explanation, it may not match the actual values)

The tree starts to grow by finding that independent variable and threshold at which the reduction of the metric (mentioned as ‘criterion’) is maximum.

Sequence of steps:

  1. The ‘mse’ (mean squared error) of dependent variable in the training data is computed. Here ‘mean squared error’ explains the spread of the values from the average (like variance). (MSE is calculated because in the code, criterion=‘mse’)

The ‘y’ values in our data : 26,15,25,30,10,24,13,8,14,27

2. Next, the model tries to split the dataset by trying every single independent variable and every possible threshold as a condition and calculates the MSE after splitting the data.

What are the possible splits on the data ?

For example, We try to split the data based on the presence of fresh chilies,

Data — after splitting on the variable chilies

Calculating MSE for the data after splitting on ‘presence of chilies’:

Reduction in MSE = 57.36–57.33=0.03

Similarly, let’s try to split our data on the variable ‘is the dish cooked for kids?’

Reduction in MSE = 57.36–57.302=0.058

Likewise the data can possibly be split on cuisine, i.e., if type of cuisine= ‘Thai’, etc., and the corresponding reduction in MSE is computed.

What happens in the case of a continuous variable?????

‘Quantity of dish prepared’ is one such variable in our dataset.

The reduction in MSE is calculated keeping different values of the continuous variable as the threshold.

Reduction in MSE = 57.36–19.3=38.06

The process is iterated for different thresholds of the continuous variable.

The ‘reduction in MSE’ value tells us if the splitting of data on the particular variable is effective. In the same way, for every single variable the reduction in MSE is calculated.

3. The variable which gives the maximum reduction in MSE is thus chosen as the root node to split on. (in the code we mentioned splitter=’best’).

In our example, the variable ‘the quantity of dish prepared’ and its threshold 1115 is chosen as the root node because it leads to the maximum reduction of MSE.

Hence, the first condition used to split the data is ‘quantity of dish prepared’<=1115.

4. For the tree to grow, the nodes resulting from the root may be split further. We have 2 nodes branching from the root(one, consisting of data wherein quantity of end dish ≤1115 and the other, with data wherein quantity of end dish >1115).

‘Which of those two nodes branch further?’ is answered based on the hyper-parameters that we specify.

The hyper-parameters ‘min_impurity_split’, ‘min_samples_leaf’, ‘min_samples_split’ and ‘min_impurity_decrease’ help us to decide if a node will branch. Once the node which can branch is selected, the same procedure repeats in order to find the variable and threshold which leads to the maximum reduction in mse (mse of the values in the selected node).

This process repeats until the max_depth of tree or max_leaf_nodes is reached. The growth also stops when no further splitting is possible. The following image is the decision tree model for our training set.

Decision Tree Model for our data set

Representation:

Each block/node in the model gives us insights on 4 aspects,

1. Number of samples of the training data present in that node(samples)

2. Mean value of the points in that node(value)

3. MSE of the values in that node(entropy)

4. The variable and threshold based on which that node is split further.

Notice that all leaf nodes gives only insights 1, 2 and 3 because they do not branch again.

We can see 10 leaf nodes in our final model because we had 10 unique values of the dependent variable in our data. Generally, a fully grown tree is not preferred. Such a tree grows until there’s exactly only one unique value in each leaf node.

(Damn…that would lead to one big tree if the training data is large).

It will also lead to another problem called over fitting of the model to our training data. Hence it is essential to tune the hyper-parameters. For more details on hyper-parameters refer

https://scikitlearn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor

The model is built. What next???

During the process of predicting for an unseen data point, the model uses the independent variable values of the unseen data and checks them on the conditions that were developed on the training data. At some point, the unseen data point falls under one of the nodes and the final prediction is given as an average of the training points in that node. The model simply returns the ‘value’ insight of the node under which our unseen data falls.

Say we need to predict the quantity of chili powder for a dish which has the following attributes: [Thai, with chilies, not made for kids, base ingredient: Rice, 850 grams of end dish], the data point travels through the path shown in red and ends up in a node(beyond this no condition is checked). The ‘value’ insight of that node (which is the average chili powder of the training data points in that node) = 10, hence the predicted quantity of chili powder for that unseen data point=10 grams.

Photo courtesy: https://www.barnorama.com/20-hilarious-surprised-cats-photos/

Okay…Wait….What?? Average of points??? That’s exactly how k-Nearest Neighbor algorithm predicts. Yes.. K-NN takes k ‘nearest points’ and returns their average as the predicted value. Decision Tree Regressor works the same way, but to know which points to average on, it uses the splitting technique.

Happy Learning!!!!!!!

References:

[1] https://saedsayad.com/decision_tree_reg.htm

[2] https://scikit-learn.org/stable/modules/tree.html

--

--