Simple Linear Regression

Pankaj kanyal
5 min readAug 19, 2022

--

Intuition | Mathematical Approach | In Python | Assumptions

Photo by Isaac Smith on Unsplash

In this blog, we will try to understand the very first algorithm that most people learn during their Data Science or Machine Learning Journey Linear Regression

Topics Covered in this blog

  1. Intuition
  2. Mathematical Approach
  3. Simple Linear Regression in Python
  4. Condition Required to implement Linear Regression

Intuition Behind Linear Regression

Linear Regression, is a type of supervised machine learning algorithm. It is used to predict the continuous variable output. It works on the concept of dependent and independent variables.

Let's just think about this with a real-life example of demand, supply and effect on the cost of a product. If a product is in very high demand but has a limited supply the cost of the product rises in the market. Similarly, if the demand is less the cost will itself get decreased.

The second example we can take of ice cream sales and temperature. The more the temperature will get rises there is increase the sales of ice cream and vice-versa.

Wait! did you just observe what is happening? I think you got it. The demand and supply are directly affecting the cost and in the second example, the temperature is directly affecting the sales of ice cream. So we can say that the demand, supply, and temperature are the independent features and cost, and sales are the dependent features here.

Mathematical Approach

Correlation Concept

If you know the concept of correlation you can directly jump on to the next paragraph. Correlation is a statistical term or a value use to define the relationship between two variables.

The relationship can be described as positive, negative, or neutral correlations

Source: https://medium.com/@pankrj123

In the case of linear regression, we can only predict the nearest outcome only when there is a relationship between X and Y, which means either their relationship must show a positive or negative correlation.

The formula can give the Strength of correlation

Source: Google sites
Source: https://medium.com/@pankrj123

Now, we are pretty much clear about the correlation coefficient and understand what it really does. Let’s now move to the equation that our models try to fit in the data.

Simple Linear Regression

Simple Linear Regression is a method for predicting a quantitative response using a single independent feature.

Source: https://www.alpharithms.com/simple-linear-regression-modeling-502111/

Above is the image showing the mathematical equation for simple linear regression.

In simple linear regression, we try to predict the slope of the line (m). Mathematically the formula for finding slope(m) can be given as

Source: https://medium.com/@pankrj123

To find the intercept of the regression line we use the formula

Source: https://medium.com/@pankrj123

Simple Linear Regression in Python Sample Code

Importing Libraries

import numpy as np
import pandas as pd
import random
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

Generating Random data as X and Y

X = [random.randint(0,100) for i in range(10)]
Y = [random.randint(0,1000) for i in range(10)]

Sort Both the List X and Y to maintain Linearity in the data

X.sort()
Y.sort()

Creating DataFrame Using pandas

df = pd.DataFrame({'X':X, 'Y':Y})
Source: https://medium.com/@pankrj123

Calculating the X square column and XY Column

df['X^2'] = df['X']**2
df['XY'] = df['X']*df['Y']
Source: https://medium.com/@pankrj123

Getting all the values required to find the slope and the intercept for the regression line.

meany = np.mean(df['Y'])
meanx = np.mean(df['X'])
meanxy = np.mean(df['XY'])
meanx2 = np.mean(df['X^2'])
meanx_sq = meanx**2

Calculating Slope and Intercept

slope = (meanxy - (meanx*meany)) / (meanx2 - meanx_sq)
b = meany - slope*meanx
Source: https://medium.com/@pankrj123

In my case, the slope was around 8.2 and the intercept was around -53. In your case, values may seem different because data is randomly generated.

Generating the x and y coordinated for drawing the line

x = [i for i in range(1,100)]
y = [i*slope+b for i in range(1,100)]

Plotting the Line

plt.plot(x,y,color='black')
sns.scatterplot(df['X'],df['Y'],color='green')
plt.show()
Source: https://medium.com/@pankrj123

Here, the green points show the data points and the black line is our regression line which gives us the predicted points for different values of our independent variable.

Assumptions for Linear Regression

To make your model more effective on the dataset and to get good accuracy. The data should follow certain conditions, based on the behaviour of our data we follow different machine learning algorithms to produce predictions.

  1. Linearity: There should be a linear relationship between the target and the
  2. Homoscedascity : The error term have a constant variance.
  3. Independence of features or mutually exclusive features (No multicollinearity)
  4. The mean of residuals should be zero.
  5. The error term should be uncorrelated from each other.
  6. The error term should be normally distributed.

Thanks for reading this far. Do comment below to let me know about your views.

--

--

Pankaj kanyal

Data Science Enthusiast, Learning to walk in the field of Computer Science and Machine Learning.