Simple Linear Regression
Intuition | Mathematical Approach | In Python | Assumptions
In this blog, we will try to understand the very first algorithm that most people learn during their Data Science or Machine Learning Journey Linear Regression
Topics Covered in this blog
- Intuition
- Mathematical Approach
- Simple Linear Regression in Python
- Condition Required to implement Linear Regression
Intuition Behind Linear Regression
Linear Regression, is a type of supervised machine learning algorithm. It is used to predict the continuous variable output. It works on the concept of dependent and independent variables.
Let's just think about this with a real-life example of demand, supply and effect on the cost of a product. If a product is in very high demand but has a limited supply the cost of the product rises in the market. Similarly, if the demand is less the cost will itself get decreased.
The second example we can take of ice cream sales and temperature. The more the temperature will get rises there is increase the sales of ice cream and vice-versa.
Wait! did you just observe what is happening? I think you got it. The demand and supply are directly affecting the cost and in the second example, the temperature is directly affecting the sales of ice cream. So we can say that the demand, supply, and temperature are the independent features and cost, and sales are the dependent features here.
Mathematical Approach
Correlation Concept
If you know the concept of correlation you can directly jump on to the next paragraph. Correlation is a statistical term or a value use to define the relationship between two variables.
The relationship can be described as positive, negative, or neutral correlations
In the case of linear regression, we can only predict the nearest outcome only when there is a relationship between X and Y, which means either their relationship must show a positive or negative correlation.
The formula can give the Strength of correlation
Now, we are pretty much clear about the correlation coefficient and understand what it really does. Let’s now move to the equation that our models try to fit in the data.
Simple Linear Regression
Simple Linear Regression is a method for predicting a quantitative response using a single independent feature.
Above is the image showing the mathematical equation for simple linear regression.
In simple linear regression, we try to predict the slope of the line (m). Mathematically the formula for finding slope(m) can be given as
To find the intercept of the regression line we use the formula
Simple Linear Regression in Python Sample Code
Importing Libraries
import numpy as np
import pandas as pd
import random
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
Generating Random data as X and Y
X = [random.randint(0,100) for i in range(10)]
Y = [random.randint(0,1000) for i in range(10)]
Sort Both the List X and Y to maintain Linearity in the data
X.sort()
Y.sort()
Creating DataFrame Using pandas
df = pd.DataFrame({'X':X, 'Y':Y})
Calculating the X square column and XY Column
df['X^2'] = df['X']**2
df['XY'] = df['X']*df['Y']
Getting all the values required to find the slope and the intercept for the regression line.
meany = np.mean(df['Y'])
meanx = np.mean(df['X'])
meanxy = np.mean(df['XY'])
meanx2 = np.mean(df['X^2'])
meanx_sq = meanx**2
Calculating Slope and Intercept
slope = (meanxy - (meanx*meany)) / (meanx2 - meanx_sq)
b = meany - slope*meanx
In my case, the slope was around 8.2 and the intercept was around -53. In your case, values may seem different because data is randomly generated.
Generating the x and y coordinated for drawing the line
x = [i for i in range(1,100)]
y = [i*slope+b for i in range(1,100)]
Plotting the Line
plt.plot(x,y,color='black')
sns.scatterplot(df['X'],df['Y'],color='green')
plt.show()
Here, the green points show the data points and the black line is our regression line which gives us the predicted points for different values of our independent variable.
Assumptions for Linear Regression
To make your model more effective on the dataset and to get good accuracy. The data should follow certain conditions, based on the behaviour of our data we follow different machine learning algorithms to produce predictions.
- Linearity: There should be a linear relationship between the target and the
- Homoscedascity : The error term have a constant variance.
- Independence of features or mutually exclusive features (No multicollinearity)
- The mean of residuals should be zero.
- The error term should be uncorrelated from each other.
- The error term should be normally distributed.
Thanks for reading this far. Do comment below to let me know about your views.