# Linear Regression in Python from Scratch

Published in

Linear regression is one of the most basic and popular algorithms in machine learning. When any aspiring data scientist starts off in this field, linear regression is inevitably the first algorithm they come across. It’s intuitive, has a good range of uses, and is fairly straightforward to understand.

In this article we will build a simple Univariate Linear Regression Model in Python from scratch to predict House Prices.

First of all, I will tell you the basic idea behind Linear Regression. Suppose you have the dataset (Training set) of the prices of houses (Target variable) given any feature, like the size of a house or the number of rooms in the house (predictor or feature). Now your job is to predict the price of any other house, given the same feature which was given in the training dataset (like the number of rooms).

What the Linear Regression technique does is, it finds the best possible line which fits that training set and then predicts the price of any unseen house (i.e. which was not in the training set).

Linear Regression finds the parameters of that line which best fits the data, i.e., slope (theta1) and intercept (theta0) in this case.

This might seem very simple but it is the most basic implementation of this algorithm in which there is a linear relationship b/w feature and target variable and holds for only single variable.Although it may be easily converted to use for multiple features.But this may not be the case in most of the problems we face. In real world situations, you will often find that target may vary non-linearly with the feature and more than one feature will be used to predict the target variable. In such cases, a curve having higher order terms of features will be used and graphically showing model with more than two features is also not convenient.

So there are large number of problems in which Linear Regression may be useful. But we will start with the very basics, i.e., Univariate in which a feature has a very linear relationship with the target and we will implement it on the house prices problem we discussed above.

First, let’s import some libraries:

NumPy is a python library that makes numeric computations easy, specially on arrays. Pandas helps to structure and clean data for analysis in the form of Series and Dataframes. Matplotlib and Seaborn are for visualisation.

“%matplotlib inline” ensures that graphs are shown inside the Jupyter Notebook.

Now we will read the data using the pandas read_csv function. Our housing price dataset looks like this:

We will give names to columns :-

Now the dataset looks good and informative.

Utility function for plotting the relationships b/w features and target variables in the form of scatter plots using matplotlib:

So this gives the following graphs.

These graphs gives a clear intuition of relationship of target variable with features. We can see that RM(Avg. rooms per dwelling) has very linear relationship with MEDV.

Another option to see the relationship is to plot the correlation of features and target variable with each other using heatmap of seaborn.This is much more descriptive also.

MEDV has correlation coefficient of 0.7 with RM which means they vary linearly as compared to other features. So we will consider only this feature because we are not going to use any higher order terms.

How to find the best parameters ??

Now we know that we have to find the parameters which best fits the training data. But till know we don’t know how to do so. Let’s unfold that part also.

What we will do is first we will initialize the parameters to a random value (like 0). Then we will implement gradient descent to them.

Technique to find the best parameters for our line(or curve)

After initializing the parameters we will find the value of cost function at those parameters. Cost function is a function which calculates the error between the values of house (target) predicted by our hypothesis h and the actual price of houses in the training set. So goal of Gradient Descent is to minimize the cost. There are many ways to calculate cost function the most common is by calculating the difference, squaring it and taking its mean.

here h is the predicted value of any house given its features x(RM in this case) and m is the number of records in the training set.

Dividing by 2 is just for making the mathematical calculations easy.

So h can be replaced by:-

To minimize the Cost Function we will use calculus which tells how to minimize any function. Any function has a local minima at a point if its slope at that point is 0. If you don’t know calculus , actually we will find the slope of cost function w.r.t. theta1 and theta0 at particular values and will keep on subtracting that slope from theta1 and theta0 respectively. So that will make theta0 and theta1 shift the cost function more towards minimia after every iteration.

alpha is learning rate i.e. how fast we want our parameters to be updated. It should not be very less as this will make the gradient descent slow and also not very big as that can make the gradient descent to overpass the minima and it will diverge instead of converging. Generally it is kept b/w 0.01 and 0.5 but every problem has different alpha. It can be found by hit and trial starting from small value and then increasing until cost decreases on every iteration. If cost increases in any iteration, this is a sign of large alpha and you should decrease alpha.

Now let’s implement all this in code. We have separate functions for each task like predictPrice, calculateCost and gradientDescentLinearRegression.We have implemeted all the code taking theta a vector.

`abline` is a utility function to make graphs after every 1000 iteration to visualise the fitted line.

When we call the GradientDescentLinearRegression function with default arguments

So we have found the line which best describes the relationship b/w feature and target .This line can now be used to predict prices of other houses .

You can find the notebook and dataset here:

https://github.com/Nimishkhurana/Linear-Regression

Get in touch with me:-

An interactive Website to learn ML: https://pytholabs.com/?utm_source=LKDN&utm_medium=MYNK&utm_campaign=personal

--

--

Pursuing CSE @UIET, Panjab University. Competitive Programmer and Developer.