Linear Regression: Everything From Math to Program

Published in

The Startup

7 min readJul 30, 2020

A few days ago I was trying to solve one of the problems given to me in my school. They asked us to impute some missing values in a given dataset. Those of you who are aware of the term ‘imputation’, you know that there are countless algorithms published in various studies for different kinds of data.

Some of the algorithms for imputing a missing value is:

3M (Mean, Median, Mode) imputation
Least Square Imputation
Hot/Cold Deck Imputation
Linear Regression Imputation
Stochastic Regression Imputation

And many more…

In our problem, it was clearly mentioned that the given data has linear relationship and weneed to impute the values by regression method. To my surprise there were so many resources we can use to study and solve the problems and everyone was fantastic with their work. But, I was not able to relate to any of them. The reason behind this was some of the posts were solving a particular problem either by hand or a programming library. Few others were talking about math for solving problems with one or two independent variables. And rest were throwing solutions for multiple independent variables. There was no single post that approaches the problem from ground and build it to the end, at least I couldn’t find one. Now, you can definitely refer to these resources to learn but I found it difficult when I am unable to devise some relationships between them because they are eventually coming from different people and every individual has their own style of approaching the problem.

I will go through the math behind regressions for one, two and multiple independent variables, formulate the equations, and in the end develop the program in python to solve an example of regression problem. Every problem will be discussed at basic level and with appropriate examples. I will use a small dataset and solve it using Excel so that it is easy to view and understand. For the sake of simplicity, I am breaking this post in 3 parts. Each of this will discuss one part of the problem and you can skip forward to learn topic that interests you.

Part 1: Regression with One Independent Variable

Part 2: Regression with Two Independent Variable

Part 3: Regression with Multiple Independent Variables

Prerequisites: Linear Equations, basic programming in any language python preferably, correlation and it’s significance.

This is my first blog on Medium.com and I am very excited for this. Hope it helps someone.

So, Let’s begin!!!

Regression with One Independent Variable

First let’s understand what do we mean by regression problem with one independent variable.

When we have only one independent variable, say X whose value decides the value of a dependent variable, say Y and the relation between them is linear then it is a linear regression problem with one independent variable. A linear relationship between two variables is when one value increases or decreases, the value of other variable also increase or decrease proportionally i.e by multiple of some amount that we call constant which is a scalar value.

Y = c.X, where c is a constant

It is possible to plot points for linear data in a graph and then find the best line that fits that data. By fit we mean that best describes the relationship between the data.

Let’s take an example and work on that to understand the concept.

If we plot these points on a graph, it will look like below figure.

As we discussed before, we need to find equation of a line that best fits the given data. Linear equation of a line is:

General Equation of straight line

where m is the slope, C is the intercept of line on y-axis.

If we have slope and intercept then we can find value of Y for any value of X. In an ideal linear regression problem, all the points on graph will fall in one line but that is rarely the case in real life. Thus, we predict the best fit line for the data. Let’s see how.

Let’s assume the equation for best fit line predicted for the given data is,

Equation of Predicted line

The important term to understand here is Error, or some call it residual. This is the difference of actual and predicted value of Y given by,

Hence,

According to Least Square Approximation, the sum of squared error can be written as,

N: total number of data points available.

The best fit line can be obtained by reducing this error to the least value possible which can be done by setting the gradient to zero.

Solving (1)

where,

Solving (2)

Solving equation 3 and 4 for slope and intercept of best fit predicted line,

Let’s start solving the problem we assumed and predict the best fit line for our data. I will use excel for displaying the manual calculations and python in the Jupyter notebook for the programming part. You are free to refer to this code and implement in any other language for your own data.

Step 1: Create or import the data

We will use Numpy for creating the example matrix in python.

Step 2: Calculate mean of dependent and independent variable

We have added all the values of X (Sum = 123) and divide it by number of values (Count = 5) to obtain mean of X as 24.6. Similarly, Mean of Y = 51.6

Step 3: Calculate sum of product of dependent and independent variable

We have multiplied values of X and Y for each row to obtain column XY and then add all of them to obtain a sum = 8486

Calculation of Sum of Product of X and Y in Excel

Calculation of Sum of Product of X and Y in Python

Step 4: Calculate Sum of square of dependent variable

We have squared values of X for each row to obtain column XX and then add all of them to obtain a sum = 4055

Calculation of Sum of Square of X in excel

Calculation of Sum of Square of X in Python

Step 5: Calculation of Slope of line

We have used the formulae that we have derived in equ this and this, substituted the calculated values to obtain intercept = 0.4687 and Slope = 2.0785.

Calculation of Slope and Intercept in Excel

Calculation of Slope and Intercept in Python

Equation of line

Step 6: Predict the dependent variable

The easiest way to check the predicted value is by using the data available and see if you are able to obtain the correct value. We are substituting the value of X = 5 in the best fit line to obtain value of Y.

which is quite close.

Prediction from best fit line in Excel

Plotting the best fit line

Please refer to my github account to get complete program in Jupyter Notebook as well as the Excel File. Click on star icon on top right corner in github to show support to my project.

Conclusion

I hope you guys enjoy reading this part 1 of simple linear regression and learnt something today. Let me know your reviews in the comments section and follow me to get my next articles on your suggestion lists.

Linear Regression: Everything From Math to Program

Have a good time!!!!

Written by Gourav K Nayak