Linear Regression: Everything From Math to Program Part-3

Gourav K Nayak

Published in

The Startup

5 min readFeb 13, 2021

Hello All !!!

This is the third and final part of the series Linear Regression: Everything From Math to Program.

Part 1: Regression with One Independent Variable

Part 2: Regression with Two Independent Variable

Part 3: Regression with Multiple Independent Variables

If you have made it so far then it’s obvious you have understood concepts of Linear regression and how those beautiful libraries on regressions working in the back-end. I don’t want to share anything fancy with you guys at this point and make this blog lengthy and complicated. There are no derivations in this part and we will use the same concept that we have derived in part 2. We are also not performing calculations on Excel since the dataset we are working on will be huge and Excel won’t help us in that. So what we are going to do in this part.

We will build on what we have learnt so far in the last 2 parts by writing a program in Python to predict a dependent variable Y based on multiple independent variables X1, X2, X3 and so on. We are not using any popular libraries like Scikit-learn to build linear regression model. There are humongous tutorials for that. We will use numpy for matrices operation and see how we accomplish the task step-by-step so that even a novice can understand it.

Please refer to Part-2 for all the basic operations. In this tutorial, we are going to build general program which can be implemented on dataset of any size.

Prerequisites: pandas

Please share your thoughts on this series in the comments section or connect to me on LinkedIN. I would be happy to help. Also, don’t forget to follow me on medium.com for more such posts in the future.

Let’s begin!!!

Step 1: Understand the dataset

In this tutorial, we are using an insurance dataset from Kaggle. There are 7 columns in the dataset: age, sex, bmi, children, smoker, region, insurance charges. Our task is to predict the insurance charges on the basis on age, sex, bmi (body mass index), children (number of children covered by health insurance), smoker (does person smokes) and region (residential area of beneficiary).

Step 2: Import libraries

Step 3: Read dataset

We will use pandas library to read the csv file that we have downloaded from here and perform other operations.

There are 1338 rows and 7 columns.

Until now we have worked with numerical type independent variables. The other type is categorical or text format. In our dataset, they are sex (male or Female), smoker(yes or no) and region (Southwest, southeast, northwest etc.) We need to convert this columns datatype into integer before regressing.

Step 4: Convert datatype of columns to integer

To do so, we will use astype method by pandas to convert the columns to a category like 1 for yes and 0 for no. And then, assign the encoded variable to a new column using the cat.codes accessor.

convert categorical variables to integer

Step 5: Convert Pandas dataframe to numpy array

We could have used pandas for implementation of this program but I wanted to reuse the code we developed in part 2 and thus converting it to numpy array. We will use to_numpy function provided by Pandas to do so.

Step 6: Prepare training and testing data

We are using numpy array indexing to split the entire dataset into training feature and label as well as testing feature and testing label. We are using first 1334 rows for predicting the values of regression coefficients and use last 4 rows to test the performance of predicted equation.

Step 7: Calculate Regression coefficent

From here onwards, all the steps for predictions are same as explained in part 2. Please refer to part 2 in case you are unable to infer any part.

Product of Transpose of test feature matrix and original matrix

Product of Transpose of feature matrix and dependent variable

Note that the number of column of regression coefficent matrix (=6) is 1 greater than number of dependent variables (= 5)

Step 8: Prediction

In this step we will predict the testing data that we have created in step 6 and calculate the value of dependent variable (or insurance charges)

This may seem pretty daunting to you if you are not familiar with programming in python. Let me break it down for you.

Firstly we create a numpy matrix of size equal to number of testing data (=4). Then for each testing row we substituted the value of regression coefficents and calculated the predicted value of Y.

As usual, you can refer to my github account to get complete program in Jupyter Notebook. Click on star icon on top right corner in github to show support to my project.

Conclusion

This was all I had to share with you guys on Linear regression. Please drop your reviews on the posts in the comments section. You can connect to me on LinkedIN if you want to discuss anything. I am always happy to help :)

Till next post… Stay Safe and stay healthy and keep learning.

Linear Regression: Everything From Math to Program Part-3

All the best!!!

Written by Gourav K Nayak