ML Basics — Linear Regression

Published in

Living Inside Terminal

9 min readJul 7, 2020

Introduction

This is a series of machine learning models where we will discuss the theory and practical implementation in the same article.

Starting with linear regression which is the most basic model you will learn in a data science class or any machine learning crash course.

Machine learning models are built for prediction purposes. In simple linear regression models we use x the independent variable to predict the corresponding y variable, in other words trying to figure out a relationship between the variables. In this model we have a one coefficient and an intercept. This is just a recap of what you might have learnt in middle school or even in high school.

𝑦 = 𝛽0 + 𝛽1𝑥

Here, beta 0 is the intercept of the line plot and x is the coefficient. Any linear equation can be in the form shown above thus we use it to build a model based on this concept.

Data exploration

Before we move onto prediction of values we will explore a data set of houses.

import numpy as np #numpy for using matrices and transformations 
import pandas as pd #pandas for using importing data
import matplotlib.pyplot as plt #for plotting graphs 
import seaborn import sns #data exploration 
from seaborn import heatmap #making a heatmap 
from sklearn.linear_model import LinearRegression

Here we import numpy, pandas, matplot and seaborn. These libraries are essential in data science. In brief numpy helps us with matrix calculations and transformations, pandas help us manipulate and read the data, matplot as the name suggests helps us plot graphs where as seaborn has features which help us with our data exploration that the other libraries are not capable of as we will see in the exploration part.

In the above code we read the data with pandas — pd.read_csv(path to the csv file). CSV stands for comma separated values and is considered to semi- structured data. Semi structured data differs from structured because the values in every column does not have any order like a relational database.

Now the next step is to choose independent and dependent variables with a correlation above o.5 or more. This is so that when we graph the scatter plot we can eyeball the relationship disregarding positive or negative correlation. Here I have used the inbuilt corr() function to make a data frame of numerical data and then used the data frame as an argument for the heat map. The scale on the right of the plot shows us the correlation. Here the lighter the colour of the box the higher is the correlation. I have marked GrLivArea and LotArea as my variables that I will explore in this article.

Simple Linear Regression

Plotting the graph

The above code gives us a simple graph with a scatter plot and a line going through the model.

In the code we have converted both the columns of SalePrice and GrLivArea into series with df[‘SalePrice’] and then reshaped the vertical matrix (number of inputs) x 1 to (number of inputs) as columns and 1 row. Essentially creating an array of values.

plt.scatter is a function of matplotlib which helps us plot a scatter plot of the array of values we assigned to our variables. Here, GrLivArea is the independent variable and SalePrice is the dependent variable. We will be predicting the SalePrice in the next section with GrLivArea as the predictor.

Next I have imported the linear regression model from sklearn. sklearn is a library made especially for simple machine learning. You can read more about in the documentation of scikitlearn. Model is essentially a linear regression object created to use more functions related to linear regression. As discussed earlier(y = mx + c), our c is the intercept c = model.intercept_ .

Now we come to plotting the function we get from the model. We have created the two end points of the regression line using maximum and minimum value of our independent variable GrLivArea.

y1 = model.intercept_ + (x.min() * model.coef_)

y2 = model.intercept_ + (x.max()*model.coef_)

Now we use these values for x_arr and y_arr. Making an array is not necessary as there are only two values in both of them. But I have done it here to make it easier for everyone to understand.

plt.plot(x_arr, y_arr, color = “red”)

This is the red line seen on the graph above. As we can see that this might not be the best regression line. Linear Regression is supposed to minimize the residuals. This is the distance between the predicted y value(regression line) and actual y value (scatter plot) on the graph.

Derivation and Sum of Squares(Mean Squared Error)

After many iterations we get the line of best fit and after which we use the sum of squares of residuals to measure the mean squared error which will be later explained with code.

Now, we need to look at the derivation of y = mx + c. The scatter plot points are the actual data points recorded in the data set and regression line is the predicted value.

Any point on the line gives us a tuple (xᵢ, yᵢ) and an equation :

Residuals: observed y - predicted y

Predicted is the line of best fit which is in red in the graph above and the rest of the data points in blue are the actual y values. Our cost function is the sum of squares also known as mean squared error. We find the line of “best fit” which has the minimum cost function.

We square the differences because the data points can be below the regression line which are negative.

Now we will use derivatives to find both a0 and a1:

We get -1 for the RHS and exponent of 2 gives us -2 which is taken outside of the summation.

We divide both sides by -2 to get rid of it and divide the expression into three parts. Now we have 3 summations which can be further broken down. We know that the intercept a0 is a constant therefore we will get a0*n.

We have to make na0 as the subject to get to the last expression shown above. If we separate the fractions we end up with two sums and both are being divided by total number of input values thus giving us an average(mse).

Now we have the intercept of the line of best fit. Next step is to find a1(coefficient ). I won’t go through each and every step for the next one. We take the derivative with respect to a1.

Therefore we get -2xi and -2 is a constant so it is taken outside the summation. Similar to the previous derivative we can divide by -2 on both sides.

After substitution of a0 we need to separate the single summation into two and try to make a1 the subject.

Now that we have a basic understanding of how linear regression works. Here’s a little gif to visualize how linear regression actually works when we make a model in python.

Model

After minimizing the cost function we reach line of best fit. As we can see from the gif above we reach a line with minimal distance between scatter plots and line of best fit.

Let’s jump into some code…

There are a few more imports required when we analyse the performance of the model itself such as mean squared error which we just used in the derivation:

yi-y(hat) whole squared summed together is our mean squared error which we need to minimize according to the data.

Moving on after reshaping and assigning our data to our independent and dependent variables we have split the data into two parts training and testing. Which is done by the built in library function from scikit learn called “train test split”. Here I have split the data into 80:20 for training and testing. We usually also split the testing data for cross validation but I won’t be using it for this example.

We use our training data set to fit the model. Fitting the model is essentially is the process shown in the gif above. After many iterations it finds the best fit for the data. Now with our model variable we can print the equation line y = mx + c with model.intercept_ as c and model.coef_ as m.

Next, we need to visualise the line of best fit and plot the scatter plot for our training data set (x_train, y_train).

Here the dotted line is plt.plot(x_test, y_pred). Model variable can use the linear regression prewritten functions such predict with our test values as shown in the code above. y_pred is the series as discussed earlier from pandas, essentially an array of predicted values. Just eyeballing this graph, we can see there are many outliers which increase our cost function and will return a poor test score. Therefore our training and testing scores for the model differ significantly.

The training score of the model will always be higher than test score. We did not shuffle our data here before making the model which can change the train and test scores. In the upcoming articles I will be showing how to shuffle the data either with random state while splitting the data or using shuffle function during preprocessing of data. I have deliberately shown a test score of less than 0.5 to prove that most of our models are not linear in nature. Even though there is enough evidence in the training score, testing our model gives another perspective.

Towards the end I have created database with y_pred and y_test to show the differences in our actual y values and predicted y values. The sum of the difference squared values divided by the total number of values is our mean squared error. It is large here mainly because of our data values in hundred thousands.

I hope you found this article helpful. Next article will contain a link to the github repository of the code as well as the data set.

About Me

I’m a second year Software Engineer Student majoring in Advanced Intelligent Systems Australian National University. If you want to reach out to be you can check the links below.

Twitter

Github