Linear Regression: Everything From Math to Program Part-2

Gourav K Nayak
The Startup
Published in
7 min readFeb 13, 2021

Hello Everyone!!!

This is the second part of the series Linear Regression: Everything From Math to Program.

Part 1: Regression with One Independent Variable

Part 2: Regression with Two Independent Variable

Part 3: Regression with Multiple Independent Variables

Earlier in Part-1, we have learned about how to perform linear regression in the case of one independent variable in python and Excel. We have also derived the general expression for calculating slope and intercept of a one independent variable linear regression problem.

In this part, we are continuing our learning and derive a general expression for a multiple linear regression problem. Later we will use the derived expression to solve one of the linear regression problems with two independent variables. We will represent each step using excel and also implement the programming version using Python with NumPy library. If you haven’t read Part-1 of this series you might not be able to understand few concepts of residual error. But if you think you are good with those concepts, you can skip them.

To understand Multiple Linear regression, we need some basic knowledge of Matrices and differential calculus which probably was part of your high school math. I will try to explain each step as much as possible and will give you some resources to understand the derivation part. That should suffice everything you need for this algorithm.

So, Let’s begin!!!

Regression with Two Independent Variable

As the name suggests, there are two independent variables X1 and X2 in the problem, that determines the value of dependent variable Y. Also, the relationship between X1 and X2 with Y should be linear. The objective of the problem is to determine the best plane that fits for all the given values of X1 and X2. Note that in regression with one independent variable, the objective was to find the best line.

It is possible to plot the graph and visualize the points using some visualization tools like FusionCharts, Plotly, etc. We can also use 3-D plot packages in R and Python to visualize the data. I will show that in the end as well.

Let’s view the example problem first.

Sample data

General form of a multiple linear regression equation is:

Y = β₀ + β₁X₁ + β₂X₂ + ….. + βₘXₘ (equ.1)

where β₀ is a constant and β₁, β₂…βₘ are the regression coefficients.

In case of a two independent variable regression problem, equ. 1 can be simplified as,

Y = β₀ + β₁X₁ + β₂X₂

In case of more than two independent variable regression problem (n variables and m given data), every row can be written as,

Y₁ = β₀ + β₁X₁₁ + β₂X₁₂+…+ βnX₁n

Y₂ = β₀ + β₁X₂₁ + β₂X₂₂+…+ βnX₂n

Y₃ = β₀ + β₁X₃₁ + β₂X₃₂+…+ βnX₃n

.

.

.

Ym = β₀ + β₁Xm₁ + β₂Xm₂+…+ βnXmn

This data can be re-written in matrix form as follows:

Matrix representation of data

Since now you understood the general form of the equation, we can safely say that for a set of multiple independent variables a predicted dependent variable can be calculated using the below formula:

General Representation of multiple independent variable linear regression equation for predicted data

It is clearly evident that to compute Yᵖ, all the biases need to be set. A little knowledge of the matrix here will prove handy.

From Part-1, We know that error is the difference between the predicted and actual value of y and thus, given by

Residual Error

which in matrix form for all the given data can be written as,

Matrix representation of residual Error

Sum of the squared Error will be,

On simplifying we get,

Simplified RSS

Our motive is to reduce this error and thus, we will use partial differentiation w.r.t βᵖ

Partial differentiation for RSS

Putting the value of RSS we obtained above and solving,

And this is the formula we are seeking.

Independent variable Linear regression formula

If you remember this formula, you would be able to answer all the questions you need to solve for multiple independent variables.

Let’s start solving the problem we assumed and predict the best-fit plane for our data. I will use excel for displaying the manual calculations and python in the jupyter notebook for the programming part. You are free to refer to this code and implement it in any other language for your own data.

Step 1: Create or import the data

Data representation in Excel

We are using NumPy to create matrices. We will use two different variables to store our feature data (or independent variables) and labels (or dependent variable).

Matrices in Python

Step 2: Transform the feature matrix

We need to append columns of value 1 into the feature matrix to take the bias variable Beta0 into account.

Feature Matrix in Excel

In Python, this can be achieved in two steps. First, create a NumPy matrix of values 1 and size equal to rows of given data (N = 5). Then, stack this matrix horizontally on the existing feature matrix.

Feature Matrix in Python

Note how the dimension of the feature matrix changed from (5, 2) to (5, 3) because of the additional column of ones.

Step 3: Find transpose of the feature matrix

In Excel, we can do this using the TRANSPOSE function.

Transpose of Feature Matrix in Excel

In Python, it can be easily performed using the transform function provided by Numpy. Look at how the shape of the array reversed here.

Transpose of Feature Matrix in Python

Step 4: Multiply Transpose of Feature Matrix with Original Feature Matrix

We use the MMULT function in Excel to perform multiplication.

Multiplication of X transpose with X in Excel

In python, we use the dot product to multiply 2 matrices of any size, unless one of the matrices is scalar. In that case, we use to multiply.

Multiplication of X transpose with X in Python

Step 5: Inverse of Resultant Matrix

We use the MINVERSE function in Excel for finding the inverse of a matrix.

The inverse of the resultant product of matrix in excel

In python, we can do this using the multiplicative Inverse function of a matrix.

The inverse of the resultant product of matrix in Python

Step 6: Product of Transpose of feature matrix and dependent variable

Multiplication of X transpose and Y in Excel
Multiplication of X transpose and Y in Python

Step 7: Find Regression Coefficient

Regression coefficient can be calculated using the dot product of matrices obtained in results in step 6 and step 5.

regression coefficient in Excel
regression coefficient in Python

Step 8: Predict the dependent variable

We will calculate the predicted value of the label using the predict_label method in python which will accept values of X1 and X2 as an argument.

Prediction from the best-fit plane in Excel
Prediction from the best-fit plane in Python

Step 9: Plotting

We have used matplotlib package to plot the points in python. This is definitely not the best way to plot the points in terms of implementation but I didn’t want people to confuse with loops. But that won’t be the case in part 3.

plotting in 3-D space using Python

Please refer to my GitHub account to get a complete program in Jupyter Notebook as well as the Excel File. Click on the star icon on the top right corner in GitHub to show support to my project.

Conclusion

This is all for part 2 of this series. In the final part, we will extend the scope of our problem by importing a dataset with more than two independent variables and build a python program from scratch.

Let me know your reviews in the comments section. Follow me to get notifications of my articles and connect to me on LinkedIN.

Stay Safe and healthy.

Happy Learning!!!!

--

--

Gourav K Nayak
The Startup

Research Student pursuing Master’s In I.E at University Of Windsor, Software Developer by Profession, Bachelors in M.E, Machine Learning Enthusiast