Basic Regression Models

Published in

Analytics Vidhya

8 min readAug 26, 2020

Linear Regression and Regression Trees

by Satoru Hayasaka and Rosaria Silipo, KNIME

When we talk about Machine Learning algorithms, we often think of classification problems. Indeed, the most common problems in machine learning are about classification, mainly because predicting a few classes is often easier than predicting an exact number. A less commonly used branch of data science involves numerical predictions. A family of algorithms dedicated to solving numerical prediction problems is regressions, in their basic and ensemble form. In this article, we describe two basic regression algorithms: linear regression and regression tree.

The problem of numeric predictions

An overarching goal of regression analysis is to model known numerical outcomes based on the available input features in the training set. Classic case studies are stock price prediction, demand prediction, revenue forecasting, and even anomaly detection [1]. Most forecasting and prediction problems generally require numerical outcomes.

Many algorithms have been proposed over the years, and, among those — many regression algorithms. Two very basic classic and widely adopted regression algorithms are linear regression and regression tree. We want to explore the theory behind each one of them and their pros and cons, to better understand when it is better to use one rather than the other.

Figure 1. MPG (y) vs. HP (x) in the Auto-MPG dataset

Let’s take a toy example to run our exploration: a small dataset, two numeric features (one is the target, one is the input). The “auto-MPG” dataset from the UC Irvine Repository provides a description of 398 car types, by brand, engine measures, and chassis features. Two of these attributes sound interesting for our little experiment: Horsepower (HP) and mileage per gallon (MPG) (Figure 1). It is likely that the two attributes are related.

Is it possible to build a regression model where MPG (outcome y) can be described through HP (input feature x)? The goal of the regression model is to build that function f(), so that y=f(x).

Linear Regression

There are different approaches to regression analysis. One of the most popular approaches is linear regression [2], in which we model the target variable y as a linear combination of input features x.

If there is only one input feature, the resulting model describes a regression line. If there are more than one input features, the model describes a regression hyperplane.

Figure 2 is an example of a linear regression model in a two-dimensional space. The slope and the intercept of the regression line are controlled by the regression coefficients.

Figure 2. A linear regression model fitting MPG (y) from HP (x) on the Auto-MPG dataset.

Fitting a linear regression model means adjusting the regression coefficients to best describe the relationship between x and y. To do so, we calculate the total error between the observed data and the linear regression predictions. The single error at each data point is referred to as a residual. The best model minimizes the total error, i.e. the residuals for all data points simultaneously.

The sum of squared residuals E is adopted as the total error:

as the sum across all n data points in the training set of the difference between the real value of target y and the estimated value by the linear regression model. Such difference is the residual of the data point.

So, the regression coefficients for the linear regression model are found by minimizing the sum of squared residuals E. We are in luck and this optimization problem has a closed form solution [2]. The problem of the optimum regression coefficients is solved in a two-dimensional space by the following formula:

Which leads to:

Moving to a high-dimensional space, the solution to the equation system takes the form:

Where y is the vector of target outcomes for all data rows in the training set, X the matrix of all data rows in the training set. The result is the vector of estimated regression coefficients.

Regression Tree

Another popular regression approach came out in the 90s and it is known as CART (Classification And Regression Trees) [2].

Instead of fitting all data simultaneously as in the construction of a linear regression model, the regression tree algorithm fits the data piecewise, one piece after the other. In a two-dimensional space, in interval A, every x produces y=c(A); in interval B, every x produces y=c(B); and so on. A piecewise model, like this, is a regression tree. In a higher dimensional space, each interval becomes a region of the space.

Figure 3. A regression tree model fitting MPG (y) from HP (x) on the Auto-MPG dataset.

In a regression tree model, as you can see in Figure 3, a constant value is fitted within each segment of the input attributes. This way, the outcome variable is modeled from the input features without explicitly using a mathematical function.

Now let’s have a look at how a regression tree model can be constructed.

In the first step, we want to split the training set in two subsets. Therefore, we want to find the threshold S that best splits the input feature x in two segments. Within each segment m, the outcome y is modeled by the local mean value of y, as:

Where c(m) is the constant outcome value in segment m modeled as the average value of y in segment m.

The constant value in each segment does not necessarily have to be the mean value, it could be anything else, like for example the quadratic average or even a function [2].

In this scenario, what would be the best boundary S that splits the input feature x in two segments? Let’s first have a look at the error from such a split. Within each segment m, the error can be calculated as the sum of Euclidean distances of all points in the segment to the mean value of y, that is c(m).

Thus, the total error E is the sum of the errors in all segments m.

We need to find S, so that the total error E is minimized.

In this example, after running a brute force search for the optimal split S, that is calculating mean and error for the two segments with a moving point along the whole range of x, we found that the optimum split is S = 93.5. We used a brute force search strategy, but any other search strategy would have worked [3]. This split S will become the root of the tree, as you can see in Figure 4.

Then, we grow this tree by finding another split within each of the segments, in the same way as for the previous split. We continue this process in the branch until we reach one of these stopping criteria:

- if all points in a node have identical values for all input features

- if the next split does not significantly reduce the total error

- if a split produces a node smaller than the minimum node size

Limiting the minimum node size and the tree depth is important in order to avoid overfitting.

By following the splits on a regression tree, we can easily reach the predicted outcome. The resulting tree is shown in Figure 4.

Figure 4. The final regression tree fitting MPG (y) from HP (x) on the Auto-MPG dataset.

Scoring Metrics

Once you fit a regression model to your data, how can you evaluate how accurate your model is? There are several goodness-of-fit metrics for this, e.g.: the mean absolute error (MAE), the root mean squared error (RMSE), or the R-squared, just to name a few.

The mean absolute error, or MAE, is calculated as the average of residuals. It is in the same scale as the target variable y, and it can be interpreted as the average deviation at each data point from the model.

The Root Mean Squared Error, or RMSE, is calculated as the name suggests — as the square root of the mean of squared residuals. Like MAE, it describes the deviation between the observed data and the model. However, due to its calculation, more weight is given to large deviations and consequently is more sensitive to such data points.

R-squared is a relative measure of goodness-of-fit. It quantifies the proportion of the variability explained by the model. R-squared ranges from 0 to 1, with 0 indicating no variability explained by the model, and 1 indicating all variability explained by the model.

In our toy problem, with the current set of parameters, we get the goodness-of-fit as in the following table.

Pros and Cons

Comparison on prediction error is not the only interesting comparison we can make. Indeed, one or the other algorithm will be best performing depending on the data and on the task.

In general, however, regression trees have a few advantages.

- Ease of interpretation. We can go through the tree and clearly understand the decision process to assign one value or another to the input feature.

- Execution speed. Since most of the undesired data are filtered out at each step, the tree has to work on less data the further the creation of the tree proceeds. This also leads to independence from outliers.

- No data preparation required. Because of the simple math used, it does not require statistical assumptions or special processing of the data. By comparison, linear regression requires normality of the outcome variable and independence of the training instances.

Linear regression has also some clear advantages.

- Linearity. It makes the estimation procedure simple and easy to understand.

- On linearly separable problems of course it works best.

So — as usual — depending on the problem and the data at hand, one algorithm will be preferable to the other.

The KNIME workflow used to train the linear regression model and the regression tree is shown in Figure 5 and available on the KNIME Hub under https://kni.me/w/gSAlDSojYMbi9wgl .

We hope that this comparison has been useful to show and understand the main differences between linear regression and regression trees.