Linear Regression : Basic Machine Learning

Hey everyone this blog post is a quick and brief understanding of the basic machine learning model, linear regression, using python. I’ll introduce the concept and the mathematical interpretation, then try working it with python. So let’s start with some understanding of the concepts.
Linear regression is one of the basic modeling in statistics and it is also an introduction to the world of machine learning using predictive models. Now to simplify things let’s start with what modeling is and how relevant it is to machine learning.
Modeling:
Let’s take an everyday example.
let’s assume that you live in a certain suburb and you need to travel every day to work using a public train. Now as you have been using the public transport frequently, you have a decent understanding of how long you might take to reach work. Now your mind has created a mental model of this situation with various data like commute time, distance to walk, train time schedules…etc. This mental model helps you predict what time you need to leave home if you need to reach work at 9AM (assuming this is when your work starts), how long you spend on breakfast, what time you need to wake up or the time spent on getting ready for work. This is an understanding that you have created mentally by understanding the relationship between all the data (commute time, train schedule …etc.) in this situation.
Similarly, a data scientist tries creating models using mathematical/statistical methods on data to understand relationship. This process is called modeling. Now let’s see one of such methods of modeling.
Linear Regression:
Now linear regression is basically a modelling tool where there is a dependent variable (which is the output: total time to reach work) and there is an independent variable (the input: train schedule, commute time) which has an effect on the dependent variable. Based on a model we can make inference and we can better understand relationships between an independent variable and the dependent variable or between multiple independent variables. The prediction made from a model can be still useful even if they are not completely accurate to understand the data and their relationships. Now let’s try understanding linear relationship between variables via the math behind it.
Linear relationship of x and y and the formulae:
This is the equation for any line, where x is the independent variable and y is the dependent variable. The line we want to plot for Y(predictor) as a function of X is parameterized by its slope m and “Y-intercept” b. Have a look at the diagram below and I will explain to you what and how can we interpret the m and b in the line equation: y = mx + c.

Let’s assume that there is a linear relationship between x and y. You can say if x increases by 1 unit, y will increase by exactly m units. b is a constant, also known as the y-intercept. If x = 0, then y=b. But in real life data, it is hard to say if there is a linear relationship and thus we use the simple linear regression to understand the relationship of two data variables (x and y). This is something that you will understand better by the end of this blog. Now let’s get into simple linear regression, and see what it is all about.
Simple Linear Regression:
We can use an example to explain this better. Let us plot the value of y and x on a graph and the let us just try fitting a line into these plots (which is basically of linear model) , where the line tries to have equal plots on either side . This line will basically help us predict values of y.

The difficulty in understanding the relationship between the two variable can be explored using simple linear regression. We are basically trying to create a predictive model for y and x even if x and y do not share a linear relationship. The model that is predicted is a continuous variable and the predicted y value (from the line in the graph) and the true y value (the actual y value) will have a deference and the sum total of these deference is called error or residuals. Note that we can try fitting different lines in the scatter plot but the idea is to get the best model and that can be done by finding the model which has the least error (close to zero) residual value.
(More on this can be explored by understanding mean squared error -MSE)
Check out the below image/example, this gives us an idea of the different kind of linear relationship:

Now let’s jump into python and try using it to create predictive linear regression models.
Linear Regression using Python:
There are two popular ways to apply linear regression in python.
- scikit-learn
- Stats models
I will be using an example from the notes/exercise that I attempted. Let’s start with sklearn( scikit-learn)
Scikit-learn:

First we import our dataset from sklearn library :

Dataset description: this is the Boston house prices dataset and can be used to test machine learning models. Let’s have a quick look with the description/features of the dataset below:

now let’s import pandas and work with this dataset in a Data frame. Note we use MEDV — Median value of owner-occupied homes in $1000’s as our target or y value:

An over view of the data itself:

Let’s us consider x to be the RM (average number of rooms per dwelling) which will be the independent variable. Now let’s plot x and y using seaborn.

There is positive moderate linear relationship between the variables:
now let us fit a model in the plot using sklearn functions:

Now let’s get the predictions (which the line equation for the line of best fit) and score:
Note -the score here is the R2:
The regression sum of squares(R2) is the sum of squared residuals for our model, this was what we had had discussed earlier as residual/error sum. R2 is the most common metric to evaluate a regression and is the default scoring measure in sklearn which help us understand accuracy of a model
Then we plot the predicted y value-predictions (let’s call it yhat on the graph) and the actual y value to see how well the model worked.


From the score which is basically the R2, we understand the best fit model has an accuracy of 48%.
If all the points were plotted in a nice diagonal line, then it would have been a perfect fit and if it was very cloud-like then this would be a very bad fit model.
From the above plot we can say that the observations and prediction points are way off.

now going back to the line equation:
y = mx + b
model.coef_ will give you the slop(m) and we can interpret this as for very increase of x value (RM-average number of rooms per dwelling )by 1 unit , y value(MEDV — Median value of owner-occupied homes in $1000’s) will increase by 9.10210898.
model.intercept_ will give you the intercept (b) which can be interpreted as for when value of x (RM-average number of rooms per dwelling) = 0 the y (MEDV — Median value of owner-occupied homes in $1000’s) be -34.67.
I personally prefer this method over Stats model, but either one of them get the job done. Now let’s have a look at the Stats model.
Stats model:
Mostly similar steps and quite straight forward:


Using .summary() :

Interpretation:
There are a lot of description let us consider the ones we discussed earlier. First we have what’s the dependent variable and the model and the method. OLS — Ordinary Least Squares and Least squares means we are trying to fit a regression line with a minimal residual. The coefficient of 3.6534 means that as the RM variable increases by 1, the predicted value of MDEV increases by 3.653.
Conclusion:
This blog definitely had many concepts to further explore which was outside the scope of the blog, but this is definitely a nice way to get introduced to the basic machine learning models and how to use the concept in python. In my next couple of blogs, I will try explaining more concepts, again this will be with python.
