Simple Linear Regression in a Comprehensive way

Regression is a form of predictive modelling technique, which investigate’s the relationship between independent and dependent vectors. Regression is of many types and Linear Regression is one among them. Linear Regression predicts the dependent vector by assuming the relationship between the independent and dependent vector is a straight line

Akhil Reddy Mallidi
#ByCodeGarage
8 min readAug 28, 2019

--

What is Simple Linear Regression ?

It’s the easiest approach among the Regression models. Simple Linear Regression is applied only when our data has one independent variable and it predicts the dependent vector, by estimating the relation among the dependent and independent vectors as a straight line. It expresses the relation among the dependent and independent vector’s as a straight and is in the form as below.

y = mx + cwhere
y is the dependent vector
x is the independent vector
c is the constant and also called as bias, which is added to the line
m is slope and it is an offset that equals both vectors

Mathematically, c is the y-intercept that determines the value of y when x is 0 and m is the slope which determines the angle of line.

How it works?

The Simple Linear Regression model assumes and plots a Regression line, in such a way that the Regression line should be as close as to all data points of the dataset. To get more clear idea about how it works, let’s go through an example. We have an salary dataset comprised of Years of Experience and Salary. The dataset is as follows.

Overview of dataset

Here dependent variable is salary and independent variable is YearsExperience. If there is an increase in dependent variable when independent variable increases, then there is a positive correlation among them. If decreases, then there is a negative correlation among them.

Now let’s plot a scatter plot among YearsExperience and Salary.

Scatter plot Experience VS Salary

The best fit line for the data is the one which produces least error or least square approximation error among all regression line’s that can be drawn. This method of finding best fit line is called Least Square Approximation Method.

Now let’s get started by drawing an regression line among the data points in the scatter plot using mean of independent and dependent vectors. Draw a line across the point such that the point will be the mean of two vectors.

Xm ( Mean) = Sum of all experience values / Total number of experience values
Ym (Mean) = Sum of all salary values / Total number of salary values

Now plot a line which will be our assumed regression line over the data ponts in the scatter plot.

Regression line plotted using mean

From the above plot, we can observe that the regression line is some what far from some data points. This whole process is an iterable one and will be continued until the best fit line which is having the least square approximation distance is obtained.

The values corresponding to the original value on the regression line are called predicted values. The Least Approximation distance can be calculated as follows.

Distance Approximation among original value and predicted value

To calculate the regression line, for which the approximation distance is minimum when compared to the regression line now.

The slope or multiplier of new regression line can be calculated as follows:

where Xi, Yi are the values of independent vector and dependent vectors. They are subtracted from their corresponding mean’s and the slope of new regression line is calculated.

Let’s calculate the slope for new regression line. You can view the whole calculation in the following table.

Hence the summation of d*e and e*e are 1463496.36666667 and 3.366111111. The slope of the new regression cab be calculated from the above value’s and it is 0.00016808.

I have performed all the operation using python code. Have a look at it.

Xm = np.mean(X_train)
Ym = np.mean(y_train)
sum1 = 0
sum2 = 0
print('Experience Salary d=Xi-Xm e=Yi-Ym d*e e*e')
print('------------------------------------------------------------------------------------------------')
for pos in range(0, len(X_train)):
d = (X_train[pos] - Xm)
e = (y_train[pos] - Ym)
sum1 = sum1 + d*e
sum2 = sum2 = d*d
print(f'{str(X_train[pos]):{10}} {str(y_train[pos]):{10}} {str(X_train[pos]-Xm):{20}} {str(y_train[pos]-Ym):20} {str(d*e):{20}} {str(d*d):{20}}')

From the slope, we can calculate the y-intercept or bias by substituting x and y as zero in y-mx+c equation.

The obtained line equation is the new regression line and this process is continued for all regression line that cab be possibly drawn in our scatter plot. The regression line with minimum least square approximaton error is called best fit line.

Don’t worry, python’s scikit learn library does all these hectic work for us.

R-square Regression Analysis

To check how efficient is our model fits to the data, we can use the R-square Regression analysis method. This method is also called Coefficient of determination. Higher the R-square value, higher the efficiency of our model. But not all R-square value models are bad, it is based on the problem statement.

We can find the R-square value of the model in the following way.

where Yp is the predicted dependent variable, y is the actual or original dependent variable and Ym is the mean of the dependent variable. Like in the Least square Approximation method we can calculate R-square value.

Let’s implement Simple Linear Regression on Salaries data. Firstly, import all the necessary libraries and load the dataset.

# Importing the librariesimport numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Loading the datasetdf = pd.read_csv('Salary_Data.csv')

Now go through the data or perform some EDA (Exploratory Data Analysis), to understand and get familiar with the dataset.

# Viewing few rows of dataprint('----- Few rows of data -----')
print(df.sample(10))
print('\n\n')
print('----- Features in the dataset ----')
print(df.columns)
print('\n\n')
print('---- Shape of the dataset -----')
print(df.shape)
Some insights about the datset

There are about 30 observations of data with two columns namely YearsExperience and Salary in our dataset. Our problem statement is to predict the salary based upon the experience(in years) he/she has. So YearsExperience is the independent variable and Salary becomes the dependent variable as per our problem statement.

Let’s have a look over our dataset for any missing values in the dataset.

# Check for null valuesdf.isnull().sum()
Insights about Missing values in the dataset

Hurrah..!! No missing values are present in the dataset. So no need of any data preprocessing, let’s jump directly into splitting our dataset into independent and dependent vectors.

# Converting dataset to dependent and independant vectors# YearsExperience
X = df.iloc[:, :-1].values
# Salary
y = df.iloc[:, 1].values

Now split the data into training and test data using scikit learn’s train_test_split method. I have spliced data in a way that 80 percent is training set and 20 percent is test set.

# Splitting the dataset into testing and training set'sfrom sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Some insights about the dimensions of training and test data after splitting them.

# Dimensions of datset after splitting into testing and training set'sprint(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
Dimension of the resultant datsets

Now the test and train dataset’s are ready. Let’s start importing the scikit learn’s LinearRegression model and instantiate it.

# Fitting Simple Linear Regression model to the training datafrom sklearn.linear_model import LinearRegression# Instantiating LinearRegression Modellinear_regression = LinearRegression()# Fitting to the training datalinear_regression.fit(X_train, y_train)
Output showing our Linear Regression model is trained

The LinearRegression Model that was instantiated was fitted to our training data. Means a Regression line (a best fit line) with minimum least square approximation distance was identified. So from that regression line, our model starts predicting the output.

So start passing the test data to the model, to see the predictions of salary that our model predicts.

# Predicting dependent variable using independant variablepredictions = linear_regression.predict(X_test)

Our Model has predicted the salaries of persons with respect to experience. Let’s view the predicted values and the original values together.

# Lets view predicted and original salariesprint('Predicted             -    Original')
for pos in range(0, len(predictions)):
print(f'{predictions[pos]:<{25}} {y_test[pos]:<{15}}')
Predicted and original values

In some cases, the predicted values are very close to the original values and in some cases the predicted values are some what far but no too much from the original values. This is because our Regression Model learns the correlation among the variables by expressing as a straight line. So not all data points passed through the regression lines, due to outliers and some factors. So these causes these differences between and original target values. So that’s the reason that Linear Regression is not 100 percent accurate.

Let’s visualize the relationship between the independent variables with both predicted and original values along with the Regression line to get clear idea.

Now let’s plot a scatter plot between Experience and Salary of training dataset and along with Regression line.

# Training data VS Regression line
# Regression line is drawn using predicted values for training set
plt.scatter(X_train, y_train, color='blue')
plt.plot(X_train, linear_regression.predict(X_train), color='red')
plt.title('Years VS Salary')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()
Original and Predicted values of training set

Since it’s the best fit line, mostly every data point is very close to the Regression line. Now let’s plot a scatter plot between Experience and Salary of test dataset and along with Regression line.

Original vs predicted values of test data

Even the Regression line is very close to mostly every point in the test dataset.

Mean Square error and R-square regression can be performed in the following way.

#import librariesfrom sklearn.metrics import mean_squared_error,r2_score# model evaluation for training sety_train_predict = linear_regression.predict(X_train)
rmse = (np.sqrt(mean_squared_error(y_train, y_train_predict)))
r2 = r2_score(y_train, y_train_predict)
print("The model performance for training set")
print("--------------------------------------")
print('RMSE is {}'.format(rmse))
print('R2 score is {}'.format(r2))
print("\n")
# model evaluation for testing sety_test_predict = linear_regression.predict(X_test)
rmse = (np.sqrt(mean_squared_error(y_test, y_test_predict)))
r2 = r2_score(y_test, y_test_predict)
print("The model performance for testing set")
print("--------------------------------------")
print('RMSE is {}'.format(rmse))
print('R2 score is {}'.format(r2))
mean squared error and r2 error for train and test data

The complete Jupyter notebook cab be found below

You can find the GitHub repository here

𝚃𝚑𝚊𝚗𝚔𝚜 𝚏𝚘𝚛 𝚛𝚎𝚊𝚍𝚒𝚗𝚐..!!

𝙷𝚘𝚙𝚎 𝚢𝚘𝚞 𝚕𝚒𝚔𝚎𝚍 𝚖𝚢 𝚊𝚛𝚝𝚒𝚌𝚕𝚎. 𝙳𝚘 𝚜𝚑𝚊𝚛𝚎 𝚝𝚑𝚎 𝚊𝚛𝚝𝚒𝚌𝚕𝚎 𝚒𝚏 𝚢𝚘𝚞 𝚏𝚒𝚗𝚍 𝚒𝚝 𝚠𝚒𝚕𝚕 𝚋𝚎 𝚞𝚜𝚎𝚏𝚞𝚕 𝚝𝚘 𝚢𝚘𝚞𝚛 𝚙𝚎𝚎𝚛𝚜.

𝙻𝚎𝚝 𝚖𝚎 𝚔𝚗𝚘𝚠 𝚒𝚏 𝚢𝚘𝚞 𝚑𝚊𝚟𝚎 𝚊𝚗𝚢𝚝𝚑𝚒𝚗𝚐 𝚝𝚘 𝚊𝚜𝚔 𝚒𝚗 𝚌𝚘𝚖𝚖𝚎𝚗𝚝𝚜 𝚜𝚎c𝚝𝚒𝚘𝚗 :)

𝚁𝚎𝚊𝚌𝚑 𝚖𝚎 𝚘𝚞𝚝 𝚑𝚎𝚛e

--

--

Akhil Reddy Mallidi
#ByCodeGarage

I seek out new knowledge and actively develop new skills. Loves to write. (http://www.itzzmeakhi.dev)