Regression Trees | Decision Tree for Regression | Machine Learning

How can Regression Trees be used for Solving Regression Problems ? How to Build One.

Published in

Analytics Vidhya

7 min readAug 8, 2021

This Blog assumes that the reader is familiar with the concept of Decision Trees and Regression. If not, refer to the blogs below.

What Are Regression Trees ?

Having read the above blogs or Having already being familiar with the appropriate topics, you hopefully understand what is a decision tree, by now ( The one we used for classification task ). A regression tree is basically a decision tree that is used for the task of regression which can be used to predict continuous valued outputs instead of discrete outputs.

Mean Square Error

In Decision Trees for Classification, we saw how the tree asks right questions at the right node in order to give accurate and efficient classifications. The way this is done in Classification Trees is by using 2 measures , namely Entropy and Information Gain. But since we are predicting continuous variables, we cannot calculate the entropy and go through the same process. We need a different measure now. A measure that tells us how much our predictions deviate from the original target and that’s the entry-point of mean square error.

Y is the actual value and Y_hat is the prediction , we only care about how much the prediction varies from the target. Not in which direction. So, we square the difference and divide the entire sum by the total number of records.

In the Regression Tree algorithm, we do the same thing as the Classification trees. But, we try to reduce the Mean Square Error at each child rather than the entropy.

Building a Regression Tree

Let’s consider a dataset where we have 2 variables, as shown below

fig 2.1: Dataset, X is a continuous variable and Y is another continuous variable

we need to build a Regression tree that best predicts the Y given the X.

Step 1

The first step is to sort the data based on X ( In this case, it is already sorted ). Then, take the average of the first 2 rows in variable X ( which is (1+2)/2 = 1.5 according to the given dataset ). Divide the dataset into 2 parts ( Part A and Part B ) , separated by x < 1.5 and X ≥ 1.5.

Now, Part A consist only of one point, which is the first row (1,1) and all the other points are in Part — B. Now, take the average of all the Y values in Part A and average of all Y values in Part B separately. These 2 values are the predicted output of the decision tree for x < 1.5 and x ≥ 1.5 respectively. Using the predicted and original values, calculate the mean square error and note it down.

Step 2

In step 1, we calculated the average for the first 2 numbers of sorted X and split the dataset based on that and calculated the predictions. Then, we do the same process again but this time, we calculate the average for the second 2 numbers of sorted X ( (2+3)/2 = 2.5 ). Then, we split the dataset again based on X < 2.5 and X ≥ 2.5 into Part A and Part B again and predict outputs, find mean square error as shown in step 1. This process is repeated for the third 2 numbers, the fourth 2 numbers, the 5th, 6th, 7th till n-1th 2 numbers ( where n is the number of records or rows in the dataset ).

Step 3

Now that we have n-1 mean squared errors calculated , we need to choose the point at which we are going to split the dataset. and that point is the point, which resulted in the lowest mean squared error on splitting at it. In this case, the point is x=5.5. Hence the tree will be split into 2 parts. x<5.5 and x≥ 5.5. The Root node is selected this way and the data points that go towards the left child and right child of the root node are further recursively exposed to the same algorithm for further splitting.

Brief Explanation of What the algorithm is doing

The basic idea behind the algorithm is to find the point in the independent variable to split the data-set into 2 parts, so that the mean squared error is the minimised at that point. The algorithm does this in a repetitive fashion and forms a tree-like structure.

A regression tree for the above shown dataset would look like this

and the resultant prediction visualisation would be this

well, The logic behind the algorithm itself is not rocket science. All we are doing is splitting the data-set by selecting certain points that best splits the data-set and minimises the mean square error.
And the way we are selecting these points is by going through an iterative process of calculating mean square error for all the splits and choosing the split that has the least value for the mse. So, It only natural this works.

What happens when there are multiple independent variables ?

Let us consider that there are 3 variables similar to the independent variable X from fig 2.2.
At each node, All the 3 variables would go through the same process as what X went through in the above example. The data would be sorted based on the 3 variables separately.
The points that minimises the mse are calculated for all the 3 variables. out of the 3 variables and the points calculated for them, the one that has the least mse would be chosen.

How are categorical variables handled ?

When we use the continuous variables as independent variables , we select the point with the least mse using an iterative algorithm as mentioned above. When given a categorical variable , we simply split it by asking a binary question ( usually ). For example, let’s say we have a column specifying the size of the tumor in categorical terms. say Small, Medium and Large.

The tree would split the data-set based on whether tumor size = small or tumor size = large or tumor size = Medium or it can also combine multiple values in some cases, based on whichever question reduces the mse the most. and that becomes the top contender for this variable (Tumor Size). The Top contenders of the other variables are compared with this and the selection process is similar to the situation mentioned in “What happens when there are multiple independent variables ?”

Dealing with Over-Fitting and When to stop building the Tree ?

On reading the previous blogs, one might understand the problem of overfitting and how it affects machine learning models. Regression Trees are prone to this problem.

When we want to reduce the mean square error, the decision tree can recursively split the data-set into a large number of subsets to the the point where a set contains only one row or record. Even though this might reduce the mse to zero, this is obviously not a good thing.

This is the famous problem of overfitting and it is a topic of it’s own. The basic takeaway is that the models fit to the existing data too perfectly that it fails to generalise with new data. We can use cross validation methods to avoid this.

One way to prevent this, with respect to Regression trees, is to specify the minimum number of records or rows, Aleaf node can have, In advance.
And the exact number is not easily known when it comes to large data-sets. But, cross-validation could be used for this purpose.

Conclusion

To conclude, Regression Trees are another way of calling Decision Trees that are used for regression and it can be useful in a lot of areas where the relationship between the variables are found to be non-linear.
One thing to remember is that the algorithm is prone to overfitting. So, It is better to always specify the minimum number of children per leaf node in advance and use cross-validation to find this Value.