30 days of Data Science — Day 3: Multiple Linear Regression

Published in

Data and me

8 min readMay 30, 2022

In previous articles, I explained what regression problems are and how we can use simple linear regression to solve some of them.

30 days of Data Science — Day 1: Regression problems

The journey of a thousand miles begins with a single step.”

medium.com

30 days of Data Science — Day 2: Simple Linear Regression

Are you interested in learning more about machine learning? If so, linear regression is a great place to start!

medium.com

The truth is that in reality, few things are influenced only by a single factor so looking for a single best predictor may result in disappointment. This doesn’t mean that linear regression is not a good technique for solving problems but using single linear regression as a silver bullet is not suitable at all.

Simple linear regression is not able of dealing with our complex world

But no worries, the family of linear regression models is composed of many members and after we discarded the simple linear regression model because of not being very useful (in most cases) beyond learning purposes, today we’ll meet its big brother: Multiple Linear Regression.

What is Multiple Linear Regression?

As I previously explained, in statistics, linear regression is a linear approach to modeling the relationship between a scalar response and one or more explanatory variables.

The case of one explanatory variable is called simple linear regression.
For more than one explanatory variable, the process is called multiple linear regression.

In essence, the multiple linear regression approach is the extension of simple linear regression which allows this model to deal with much more factors when determining the best line to fit our data.

Like simple linear regression models, multiple linear regression models fall under the category of supervised learning techniques because we need to know data (outcome) before being able to create any model

Why do we need multiple linear regression?

Too much reliance on a single linear regression model

Let’s suppose we want to predict the price of a house. For instance, we can see if the number of bathrooms in a house affects price, holding all else equal. If we only have one variable affecting our dependent variable (housing prices), we would be creating a simple linear regression model. But what if we believed that more than one thing affects housing prices (i.e. view, neighborhood, location to the nearest city).

This is where multiple linear regression becomes a great tool. Multiple regression can take many independent variables into a model to explain how those independent variables affect the dependent variable.

In our example above, if we wanted a model that explains our dependent variable more succinctly, we could add variables like school district, crime rate, or another variable that may help predict your target variable (price).

Math formula’s time

When you understand it by what it is: an abstraction XD

At the end of the day, the goals for a linear regression algorithm might be described as:

Find the optimal weight (which produces the least error) for each of the predictors (independent variables) that we consider influential for our target (dependent variable).
Find the value for the term that remains constant for all data (also called the y-intercept)
Find the error value due to randomness in data (this is a sample, a representation of a major population, not the population itself.
Check error (or how wrong our model is compared with real data) and reduce it through repetition.

Here are some definitions in case you don’t remember it:

Independent variable: This is the variable we’ll use as a predictor (for example a person’s height) and is assumed to have a direct effect on the dependent variable.
Dependent variable: This is the variable we want to predict (for example a person’s weight) based on the independent variable.
Error (or residual): The difference between the predicted value and the real value
Weights: The value that each parameter in the ML model has.
Loss: How wrong is our model in its estimations?

In the case of a simple linear regression model, we simply have to find the weight of a predictor, the final formula being something like:

When we translate it to a multiple linear regression model, the formula is not very different besides that each predictor has its weight (which in simpler terms would mean how much impact does this feature has in terms of the target.)

Credit **Open Data Science** for the image

Assumptions of the model

Like any model, to work properly it takes into consideration a few assumptions:

The correlation between the dependent and independent variables is strong, and the dependent variable is linearly related to the independent variables.
The independent variables are not correlated with each other.
The residuals should be normally distributed, meaning that the errors do not contain important information.
The residuals should not contain any pattern.

The problem of dealing with more features

To summarize, we want to avoid something like this.

Because we’re dealing with more predictors, we run into the possibility of having predictors which are related which each other. This is called multicollinearity, a phenomenon unique to multiple regression that occurs when two variables that are supposed to be independent in reality have a high amount of correlation.

Correlation is the association between variables and it tells us the measure the extent to which two variables are related to each other. Two variables can have positive (change in one variable causes change in another variable in the same direction), negative (change in one variable causes change in another variable in the opposite direction), or no correlation.

Independent variables should be independent of each other. When independent variables are correlated, it indicates that changes in one variable are associated with shifts in another variable, which creates a bias.

Especially if the correlation is extremely high, it can cause wrong predictions leading to misinterpretations.

Based on our example of predicting house prices, let’s suppose we have several features such as square foot living space, square foot lot, a square foot above (without basement), and square foot basement.

We can assume that as the square footage of the basement increases, so does the square footage of the areas above the basement. Likewise, we assume that the overall square footage of the house will increase in tandem with these variables.

How to deal with it?

The degree of multicollinearity can also impact how we solve the problem.

In many cases, we can overlook minor multicollinearity. So ask yourself what the issue is that you’re trying to solve and what your goal is, and then take the next steps. It’s always beneficial to know what your data is and how it behaves before working with it.

If the multicollinearity is too big to ignore, you might have to prioritize which feature to use to keep your model free of bias.

Real-life examples

Measuring the effect of fertilizer and water on crop yields.

Agricultural scientists often use linear regression to measure the effect of fertilizer and water on crop yields.

For example, scientists might use different amounts of fertilizer and water on different fields and see how it affects crop yield. They might fit a multiple linear regression model using fertilizer and water as the predictor variables and crop yield as the response variable. The regression model would take the following form:

Crop Yield = β0 + β1(amount of fertilizer) + β2(amount of water)

The coefficient β0 would represent the expected crop yield with no fertilizer or water.
The coefficient β1 would represent the average change in crop yield when fertilizer is increased by one unit, assuming the amount of water remains unchanged.
The coefficient β2 would represent the average change in crop yield when water is increased by one unit, assuming the amount of fertilizer remains unchanged.

Depending on the values of β1 and β2, the scientists may change the amount of fertilizer and water used to maximize the crop yield.

Measure the effect that different training regimens have on player performance.

Data scientists for professional sports teams often use linear regression to measure the effect that different training regimens have on player performance.

For example, data scientists in the NBA might analyze how different amounts of weekly yoga sessions and weightlifting sessions affect the number of points a player scores. They might fit a multiple linear regression model using yoga sessions and weightlifting sessions as the predictor variables and total points scored as the response variable. The regression model would take the following form:

Points scored = β0 + β1(yoga sessions) + β2(weightlifting sessions)

The coefficient β0 would represent the expected points scored for a player who participates in zero yoga sessions and zero weightlifting sessions.
The coefficient β1 would represent the average change in points scored when weekly yoga sessions are increased by one, assuming the number of weekly weightlifting sessions remains unchanged.
The coefficient β2 would represent the average change in points scored when weekly weightlifting sessions are increased by one, assuming the number of weekly yoga sessions remains unchanged.

Depending on the values of β1 and β2, the data scientists may recommend that a player participates in more or less weekly yoga and weightlifting sessions to maximize the points scored.

Wrapping up

Day 3 of our “30 days of data science challenge

Linear regression is a powerful tool, but it is not a silver bullet. However, few things in life are influenced by only a single factor, so looking for a single best predictor may result in disappointment.

Multiple linear regression works similarly with the difference that it can take many independent variables into account and help us understand how they affect the dependent variable. With adding more predictors comes the problem of multicollinearity which relates to predictors related to each other which can be solved (or not) depending on our problem.

Thank you for reading! I hope this article helped you understand multiple linear regression. Stay in tune for future articles on my machine learning journey by following me on Medium:

Data and me

Welcome to this publication about my journey to becoming a self-learn data scientist!

medium.com

30 days of Data Science — Day 3: Multiple Linear Regression

30 days of Data Science — Day 1: Regression problems

The journey of a thousand miles begins with a single step.”

30 days of Data Science — Day 2: Simple Linear Regression

Are you interested in learning more about machine learning? If so, linear regression is a great place to start!

What is Multiple Linear Regression?

Why do we need multiple linear regression?

Math formula’s time

Assumptions of the model

The problem of dealing with more features

How to deal with it?

Real-life examples

Measuring the effect of fertilizer and water on crop yields.

Measure the effect that different training regimens have on player performance.

Wrapping up

Data and me

Welcome to this publication about my journey to becoming a self-learn data scientist!

Written by Brian Rey