Multiple Linear Regression in Machine Learning

A couple of weeks ago I wrote an article on simple linear regression, which I would recommend reading before proceeding to read this one. Machine learning is a very interesting topic and I have been studying it on my free time. I hope this article sparks your interest in the subject or helps continue fuel it.

Prerequisites

An extension of simple linear regression

In simple linear regression there is a one-to-one relationship between the input variable and the output variable. But in multiple linear regression, as the name implies there is a many-to-one relationship, instead of just using one input variable, you use several.

New considerations

Adding more input variables does not mean the regression will be better, or offer better predictions. Multiple and simple linear regression have different use cases, one is not superior. In some cases adding more input variables can make things worse, this is referred to as over-fitting.

Multicollinearity

When you add more input variables it creates relationships among them. So not only are the input variables potentially related to the output variable, they are also potentially related to each other, this is referred to as multicollinearity. The optimal scenario is for all of the input variables to be correlated with the output variable, but not with each other.

What is over-fitting?

Answer cited from a Quora answer

When you’re at a concert, there’s both the symphony and the random noise. Fitting a perfect model is only listening to the symphony. Over-fitting is when you hear more noise then you need to, or worse, letting the noise drown out the symphony.
- William Chen

The model

The model for multiple linear regression is as you would expect, very similar to the one for simple linear regression. It goes as follows:
 f(X) = a + (B1 * X1) + (B2 * X2) … + (Bp * Xp).
Where X is the input variables and Xp is a specific input variable, Bp is the coefficient (slope) of the input variable Xp and a is the intercept. Let’s test this out with an example!

Preparation before doing multiple regression

  1. Collect a list of potential input variables and a potential output variable.
  2. Collect data on the variables.
  3. check the correlation between each input variable and the output variable.
  4. Check the correlation among the input variables.
  5. Use the non-redundant input variables in the analysis to find the best fitting model.

Example

In this example we are going to use a data set that provides information about a postal delivery service. It provides 4variables, milesTraveled (X1), numDeliveries (X2), gasPrice (X3) and travelTime (Y). The first three are our input variables, and travel time is our output variable.

the code
There is quite a bit of code, so just take some time to read and understand it. 
(my output looks different since i used the module prettyjson to format it)

When you run this code you get the following output…

Miles traveled has a strong correlation with all of the variables. Number of deliveries has a strong correlation with travel time and gas price, but not miles traveled. Gas price has a good correlation with travel time and number of deliveries, but not miles traveled.

It is expected that when miles traveled is used as the output variable the SSE is quite high, this is because the miles traveled values are higher than any of the other variables. Thus even though a residual/error is proportionally small the SSE may be bigger than for another variable that has worse performance, but a smaller SSE. This is why it is important to look at the SSE in the context of how big the values are of the variables you are working with.

By following step nr. 5 which is “Use the non-redundant input variables in the analysis to find the best fitting model.”. All of the input variables have quite a strong correlation with delivery time, thus we do not need to withdraw any of them. Let us create a function that can help us calculate the delivery time based on the input variables.

By using the model f(X) = a + (B1 * X1) + (B2 * X2) … + (Bp * Xp). Which was covered earlier in the article the function ends up looking like this:

Let us test the function with a delivery from the data set and see how well it performs. I chose to run the function with the 5th delivery as input.

The code

The output

This is very good! The computer is totally ignorant to the exact travel time which is 4.8, it only knows about the input variables milesTraveled, numDeliveries and gasPrice. Still it was only ~0.3 away from perfectly predicting it. This is more than accurate to be useful in the real world.

Imagine you are running a postal service. You could provide your users with accurate estimates of when their packages will be delivered. You could also utilize the other relationships discovered between the variables. For instance, how much gas money the person doing the delivery needs based on how many packages they have to deliver. This is why not skipping step number 4, “Check the correlation among the input variables.” is very important.

Conclusion

Thank you for reading to the end of the article and I hope it provided you with some useful knowledge on the topic of machine learning and specifically Multiple Linear Regression. I will most certainly post more machine learning articles in the future! If you have any questions please feel free to email me or send me a direct message on twitter.