Multiple Linear Regression

Introduction

Published in

The Startup

8 min readJul 15, 2020

Multiple Linear Regression is one of the Algorithms used in Machine Learning to predict the dependent variables using the independent variables. It is the extension of Linear Regression wherein we have a single dependent and independent variable.

The equation of a straight line is given by

But to be realistic mostly in real life scenarios Linear Regression is seldom used as Multiple parameters or attributes are necessary for prediction.

So Multiple Linear Regression comes handy in these cases but it is mostly underrated as it is a most common trend among students to learn about Linear Regression and skip Multiple Linear Regression,Here we are going to explore more about the less taken path i.e about Multiple Linear Regression

Before exploring Linear Regression lets have some insights on Linear Regression and it’s fundamentals.

An Insight on Linear Regression

I will cut short my explanation as this is blog is mainly focused on Multiple Linear Regression,when we think of Linear Regression points we ponder about are

Line of Best Fit
Cost Function
Gradient Descent

Line of best fit is a line which has the least error from considering all the points in the plot,the error term can be calculated using many ways such as RMSE, MSE, Euclidean Distance etc.

Cost Function

Cost Function is defined as the following

Where J(θ) is the cost function which is the MSE(Mean Square Error) of difference of predicted output and desired output.

h(θ) is same as y = mx + c that is equation of a straight line but it gives us the predicted output.

Our Aim is to get the minimum Cost Function(Error) so to achieve this we differentiate to minimize the cost function so we get equations in variables of x and y and solving that would get the equation for a Regression.

The gradient Descent Concept is very much similar to a ball oscillating in a parabolic plane which can be visualized as follows.

The above is the Gradient descent method which takes us towards Minima but whether it is local or Global depends on the parameters that we tune like Learning Rates are involve which is to be set by the Designer.

The key takeaways from the above formula is

We have a negative sign which means we are moving in the direction away from the slope i.e we are moving in the direction of Minima.
We are taking a partial derivatives since the cost function involves more than one variable,it makes more sense when we extend this to Multiple Regression where too many variables come into picture.
We decrease the error as moving close to the Minima.

The most important thing is the Learning rate should be optimal else we don’t get significant results after every iteration or we perform oscillations in the Gradient Descent Curve.

But Scenarios change when we move into Multidimensional space as there are many variables, so the partial derivative takes us to the minima in each of the dimension which can be visualized by the following diagram.

We can see the arrows which are the directions to the minima,and this is a 3D space but things gets complicated as we move up the hierarchy of Multidimensional space.

So that was about Linear Regression

An Intuition on Multiple Linear Regression

Multiple Linear Regression is the extension of Linear regression but it comes with a trade off as it is an hiccup for the Data Analyst to visualize,it takes him to Multidimensional space, so he has to make use of other tools in order to dig deep into the correlation between the data.

Let’s express the formula for Multiple Linear Regression.

This includes many independent variables which are in disguise the attributes specified in the data set, we have to find the coefficients such that we could fit the n-dimensional hyper plane in its respective space appropriately.

To make things more clear we use a mathematical tool called Linear Algebra which gives a Matrix view of the above formula.

The above matrix representation is self explanatory.

But there is a catch here we don’t require every parameter in the data set for prediction there might be some parameters which don’t contribute significantly to the Machine Learning Model ,we can eliminate those parameters and we can push our Minima more towards existing parameters.

Till now it was all Mathematical intuitions ans theories,but to put it into application we need Data analysts and programmers who have to collaborate in a brainstorming session to give these variables a physical existence.

There are different Methods for Multiple Linear Regression they are

All in
Backward Elimination
Forward Elimination

and many more…

We are mostly going to discuss the Backward Elimination Method which is one of the popular methods.

There is a great amount of Intuition behind the Backward Elimination we will begin our journey looking into p-value.

What is p-value?

p-value is a statistical hypothesis testing tool, which takes into consideration NULL hypothesis and Alternative hypothesis and makes use of the probability curve(Bell curve) and gives the results for extreme cases of Hypothesis like 5% and 95% values.

So let’s dig deep into it.

Just like any other probability and statistical problem we approach this again with our traditional coin tossing method.

When we want to do this we must know that entire outcome space is a Universe and it is divided into something called

NULL Hypothesis
Alternative Hypothesis

Let us take a coin and make a hypothesis that the coin is fair coin which happens to be our Naive or NULL Hypothesis.

NULL Hypothesis : The coin is a fair coin.

Alternative Hypothesis : The coin is biased.

The Alternative hypothesis is default as it is complement of NULL Hypothesis.

So assume we make a toss of the coin several times and noted the outcomes.

Number of Toss           Outcome                 Probability
    1                     Heads                      50%
    2                     Heads                      25% 
    3                     Heads                      12.5%
    4                     Heads                      6.25%
    5                     Heads                      3.125%
    6                     Heads                      1.5%
    7                     Heads                      0.75%

The probability above is chances of head falling in subsequent tosses.

The probability of 50% means that for 2 tosses 1 head is possible and 25% means that for 4 tosses both the coins getting head is only 1 out of 4,similiarly for the subsequent tosses the probability implies correspondingly.

So now let us take a scenario to explain this carefully.

Your friend is tossing a coin and you believe that the coin is fair and which is our NULL Hypothesis, and your friend has proposed the Alternative hypothesis.

Your reaction after every toss.

Toss 1 : The probability is 50% and you think it is most common.

Toss 2 : The probability is 25% and you think it is common.

Toss 3 : The probability is 12.5% and you think of giving it another try.

Toss 4 : The probability is 6.25% and you start to suspect but still u feel there is a possibility of it happening.

Toss 5 : The probability is 3.13% and your suspicion converts into reality.

Toss 6 : The probability is 1.5% and you are very much sure that the coin is biased.

This is the curve which displays the transition of your thoughts from Toss 1 to Toss 6.

p-value and significance level both go hand in hand as NULL hypothesis is at the peak of the bell curve as it transitions it moves towards Alternative hypothesis.

In our case Toss 6 probability is 1.5% which means that significance level of NULL hypothesis is only 1.5% and 98.5% for Alternative Hypothesis,which means 98.5% chances are there that the coin is biased.

So we set a significance level of 5% in common before performing Hypothesis testing, the hypothesis below the threshold are discarded.

Application of p-value in Backward Elimination

Coming back to our main point, p-value is the foundation for backward elimination,but here we want to remove the attributes which are not significant which is our primary agenda.

So our NULL hypothesis here would be to consider the attributes as not significantly contributing to the model.

Which means higher the p-value more it leans towards the NULL hypothesis and less significant to the model, so we discard those attributes and the attributes remaining are the significant attributes.

import statsmodels.api as sm

statsmodels is the library extensively used in python for calculating statistical measures.

We use Ordinary Least Square(OLS) method to generate the OLS summary table which contains different measures and values and p-value is also present and we use that value for our calulations.

The OLS summary table looks something like this.

OLS Regression Results                            
==============================================================================
Dep. Variable:                Lottery   R-squared:                       0.348
Model:                            OLS   Adj. R-squared:                  0.333
Method:                 Least Squares   F-statistic:                     22.20
Date:                Fri, 21 Feb 2020   Prob (F-statistic):           1.90e-08
Time:                        13:59:15   Log-Likelihood:                -379.82
No. Observations:                  86   AIC:                             765.6
Df Residuals:                      83   BIC:                             773.0
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
===================================================================================
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept         246.4341     35.233      6.995      0.000     176.358     316.510
Literacy           -0.4889      0.128     -3.832      0.000      -0.743      -0.235
np.log(Pop1831)   -31.3114      5.977     -5.239      0.000     -43.199     -19.424
==============================================================================
Omnibus:                        3.713   Durbin-Watson:                   2.019
Prob(Omnibus):                  0.156   Jarque-Bera (JB):                3.394
Skew:                          -0.487   Prob(JB):                        0.183
Kurtosis:                       3.003   Cond. No.                         702.
==============================================================================

The python code for implementation of Backward Elimination is

#Backward Elimination
import statsmodels.formula.api as sm
x = np.append(arr = np.ones((50,1)).astype(int),values = x,axis = 1)def BackwardElimination(x,sl):
    n = len(x[0])
    for i in range(0,n):
        regressor_OLS = sm.OLS(y,x).fit()
        maxv = max(regressor_OLS.pvalues).astype(float)
        if maxv > sl:
            for j in range(0,n-i):
                if(regressor_OLS.pvalues[j].astype(float) == maxv):
                    x = np.delete(x,j,1)
    regressor_OLS.summary()
    return xsl = 0.05
x_opt = x
x_model = BackwardElimination(x_opt,sl)
x_model.shape

The above code generates the OLS summary table every time and checks for the largest p-value and deletes that particular attribute from the actual set of attributes.

Then we can normally apply Linear Regressor from sklearn for other parameters to get the desired output.

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x_train,y_train)

Note: 1) Unlike Linear regression Multiple Linear Regression is a bit hard to visualize but we have to get the n-dimensional geometry right and think laterally.

Any comments or changes are seriously welcomed and another blog in queue stay updated!!!

References

The credits for images goes to Google images
The OLS summary table https://www.statsmodels.org/stable/index.html