Linear Regression : A Beginner’s Approach

Abhishek Chougule

Published in

Analytics Vidhya

9 min readMay 3, 2020

Before Starting the blog lets us refresh some of the basic definitions

What is machine Learning?

“Machine learning (ML) is the study of computer algorithms that improve automatically through experience.It is seen as a subset of artificial intelligence. Machine learning algorithms build a mathematical model based on sample data, known as “training data”, in order to make predictions or decisions without being explicitly programmed to do so .Machine learning algorithms are used in a wide variety of applications, such as email filtering and computer vision.”
source: Wikipedia

As, per the above definition the models require training data to compute predictions. Depending upon the data provided the machine learning algorithm can be broadly classified into 3 categories namely supervised, unsupervised and semi supervised learning.

In this blog, we are going to learn about one such supervised learning algorithm.

Supervised Algorithms basically can be divided into two categories:

Classification:

In this type of problem the machine learning algorithm will specify the data belongs to which set of class. The classes can be binary as well as multiple class.

E.g. Logistic Regression, Decision Tree, Random Forest, Naïve Bayes.

Regression:

In this type of problems the machine learning algorithm will try to find the relationship between continuous of variables. The modelling approach used here is between a dependent variable (response variable) with a given set of independent (feature) variables. Just fyi I will be using Response and target variable’s interchangeably along with LR for Linear Regression.

E.g. Linear Regression, Ridge Regression, Elastic Net Regression.

Today we are going to learn Simple Linear Regression.

Linear Regression:

I personally believe to learn any machine learning model or any concept related to it if we understand the geometric intuition behind it then it will stay longer with us as visuals helps sustain memory as compared to a mathematical representation.

So, to proceed with the blog we will be categorizing into below aspects for LR:

1. What is Linear Regression?

2. Geometric Intuition

3. Optimization Problem

4. Implementation with sklearn Library.

What is Linear Regression?

As stated, above LR is a type of Regression Technique which tries to find relation between continuous set of variables from any given dataset.

So, the problem statement that algorithm tries to solve linearly is to best fit a line/plane/hyperplane (as the dimension goes on increasing) for any given set of data. Yes, It’s that simple

We understand this with the help of below scatter Plot.

Geometric Intuition:

To represent visually let’s take a look at below scatter plot which is derived from Boston Housing Data set which I have used as an example in the latter part of the blog.

The co-ordinates are denoted as below,

X-Axis -> Actual House Prices

Y-Axis -> Predicted house Prices

People I want you to stay with me on this as we need to try to understand what LR is doing visually So, having a look at the scatter plot if we visualize and try to pass an imaginary line from the origin then the line will be passing from most the predicted points which shows that model has correctly plotted most of the response variables.

Optimization Problem:

Now, as we are getting hold of what LR does visually let’s have a look at the optimization problem that we are going to solve because every machine learning algorithm boil’s down to an optimization problem where we can understand the crux of it through a mathematical equation.

For any Machine learning algo the ultimate goal is to reduce the error’s in the dataset so that it can predict the target variable’s more accurately.

With that said what is the optimization problem for Linear Regression? Which is used to minimize sum of errors across the Training data. The next question which will pop up in your mind is why sum of errors?

The above image which is a lighter version of the scatter plot answers that question perfectly, so as you can see we have drawn a line passing through the origin which will have the points which are correctly classified denoted by green cross. But also, if we look closely there are few of the points which are not present on that line denoted by red cross. These points can be differentiated as points above and below that line. These points can be defined as data point which the model was not able to predict it correctly. And the optimization problem is used to reduce the distance of these error points.

The error points above the line can be represented as:

Error₁ = Y₁ — Y₁^ { Y1: Actual point ,Y1^ Predicted Point }

The Error₁ will be having a positive value as Y₁ will be greater than Y₁^.

The error points below the line can be represented as:

Error₂ = Y₂ — Y₂^ .

The Error₂ will be having a negative value as Y₂ is smaller than Y₂^.

Error₃ = Y₃ — Y₃^ .

The Error₃ will be zero as the point lies on the line. So actual and predicted value are same all the points represented by green cross.

So, if we try to visualize Figure -2 more in terms of Figure -1 then we will come to know that we there will be many positive and negative errors and solving it won’t give us an absolute value. So, to mitigate this issue we will be squaring off those values so that the negative values will be converted to absolute value.

Equation of the n- dimension plane is:

πₙ = Wᵗ X + W₀ = 0 , {where W is the vector term and W₀ is a scalar}.

I believe you must be familiar with basic of linear algebra to know how equation of plane is derived. But as a topping for now I would suggest to map this equation with the slope of the line as y = mx+c {m = Wᵗ & c =W₀}.

That brings us to the optimization problem which we are going to solve:

Argmin (W, W₀) ∑ (Yᵢ — Yᵢ^ ) {i goes from 1 to n}

Replacing the Yi^ term with the equation of the plane

Argmin (W, W0) ∑ (Yᵢ — WᵗXᵢ + X₀ )² as {Yi^ = WᵗX + W₀}

Now, there might be a confusion why are we replacing the π with the Yᵢ^ is because as we refer from above image that the actual optimization problem is to reduce the distance of the points which are not present on the line.

Also, one thing to worth mentioning here w.r.t to LR is as we are doing the square of the linear equation LR is also referred as Ordinary Least Square or Linear Least Square.

And the whole (Yi — WᵗXᵢ + X0 )² Loss Term is called as Squared Loss.

To derive the optimization problem is mathematically exhaustive And I don’t intend to solve it here in the blog as I am trying to cover only the basics of the LR.

But I have provided the link from where you can do further reading of Linear regression.

But, from an intuition perspective I would like to explain the below parabolic graph. Which shows the Error Vs the Squared Loss Term.

In Figure-3 Function(X) = WᵗX + W₀. From which we can determine that as the error goes on increasing along the X-axis from the positive as well as negative end then the square of the error term will also be increasing respectively. And the error term at the origin will be zero. So, now you can see that the main objective is always to keep the error term as close to origin as possible.

I hope now you must be having a brief understanding of what is linear regression and how it works.

So, now let’s understand how we can try out how linear regression works via sklearn learn library. If you are anyone who has just started to discover about machine learning you must get hold of this brilliant Sklearn Library. For only Linear Regression we can also refer Sklearn Linear Regression .

Let’s code.

Before starting let me give an overview about the dataset which we are going to use.

Boston Dataset — This data set is a copy of the UCI Machine Learning repo which has 506 instances i.e. datapoint and 13 features to work with. The data was taken by U.S Census Service concerning housing in the area of Boston Mass. From Sklearn Boston Dataset link we can get the information of the dataset . The reason to use this dataset is because this dataset was a part of many machine learning papers which addressed the regression problem.

Import required library

Sklearn provides a couple of datasets to implement the algorithms. As stated earlier I have used Boston dataset for our explanation.

The dataset contains represents below features. Details about each feature can be obtained from the link mentioned above.

Below are few of the target variables aka the response variable (Actual Yi ) of the features.

Here, we have loaded the boston_data into a pandas data frame. And we have created two variables X -> Contains the features and Y-> Contains the Target variables. Also, we have done splitting of the data into 70–30 fashion which means 70% data is used for Training and 30 % of data is used for testing which is unseen data for the trained model through the train_test_split library.

After partitioning the data into 70–30 we have 339 samples for testing and remaining 167 samples for testing out our model.

LinearRegression () class needs to be defined in order to apply the fit method on our training data. The fit method will tell us how well we approximate the response variables which are dependent on the independent variables.

Now, it’s time to check how the algorithm has performed on our dataset i.e. to calculate the squared error . LR for sklearn readily provides the square of error for regression from this metric link. The Model has given as error of 24.07 which looks good in term of the data on which it is trained.

To Conclude the Blog, I hope the reader must have got the basic flavour of what Linear Regression is and will be curious to learn more in detail about it.

Further Exploration:

1. Try Model with Different error metric for Linear Regression like: Mean Absolute Error, Root mean squared error.

2. Try algorithm with large data set, imbalanced & balanced dataset so that you can have all flavours of Regression.

Thank you Folks for Reading !