Linear Regression in Details
In this blog, I have explained Linear Regression in details.You will get a complete understanding of this algorithm from the basic to the advance level.I have mentioned theoretical and mathematical concepts along with practical implementation. After reading this blog, you will be very confident and can further deep dive into complex algorithms. So, lets move further.
What is Linear Regression ?
It is the supervised machine learning algorithm. It is used for predicting the continuous values. continuous values are the values within a given range and have a infinite possibilities. eg: weight,height,temperature etc. Linear regression shows the linear relationship between the independent variables (x-axis) and dependent variable (y-axis).
Types of Linear Regression
Simple Linear Regression: If there is only a single independent variable. Ex: In our dataset, cgpa (independent variable) and we have to predict package(dependent variable).
Multiple Linear Regression: If there is more one independent variable is present in the dataset, then we can apply multiple linear regression. eg: cgpa,studytime are two independent variables or even more can exists.
Regression is defined as the line or curve that passes through all the data points on the target-predictor graph with the shortest vertical distance between the data points and regression line.
This is completely linear data, but in real world data,it si difficult to find completely linear data,we have a sort of linear data . You can see the image below.
Y=mx+b is the equation of the line.
Y-Dependent variable
m=slope
x=independent variable(data point)
b=intercept
Mathematical intuition of Linear Regression
In linear regression , our main goal is to find best fit line,to do that we have to reduce the distance between between predicted and actual values.Here we are taking only nine datapoints for understanding ,so distance will be d1(for first data point),d2(for second data point ) and so on upto d9.I can take the sum of these distances as:
E=d1+d2+d3+d4+d5+d6+d7+d8+d9
we have to reduce the value of E to find the best fit line which try to pass through all the data points for which our loss should be minimum.
I can write the above equation as:
Why we are squaring it?
Because distances are above the line and below the line also.DIstance below the line is negative and when we add +ve and -ve ,they will cancel out .
We can also take modulus to make -ve term to +ve term ,why we didn’t that?
The reason is graph of the modulus is not differentiable at zero whereas graph of square is differentiable.
we can also rite equation as:
Equation of the line is y=mx+b. If we have a dataset with single independent variable as cgpa and we have to predict package(dependent variable), then we can modify the equation as :
package=m*cgpa+b
If we know the vaue of m and b , we can easily calculate the package of the student.
So , we need a such value of m and b for which the Error function (E) should be minimum.
Now we will calculate the value of m and b taking a single data point.
# import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split # for splitting the data into training and testing
from sklearn.linear_model import LinearRegression # LinearRegression model
# import dataset
df=pd.read_csv("placement.csv")
df.head()
# check rows and columns
df.shape
# check null values
df.isnull().sum()
#check duplicate values
df.duplicated().sum()
# to check data is linear or not using scatterplot
plt.figure(figsize=(15,9))
plt.scatter(x=df['cgpa'],y=df['package'])
plt.xlabel("cgpa")
plt.ylabel('package')
plt.show()
X=df['cgpa']
y=df['package']
#split data into train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)
print(X_train.shape)
print(y_train.shape)
X_train=np.array(X_train).reshape(-1,1) # We have single feature and linear regression expects 2-D array
lr=LinearRegression()
lr.fit(X_train,y_train) # fit the model using training data
plt.figure(figsize=(15,9))
plt.scatter(df['cgpa'],df['package'])
plt.plot(X_train,lr.predict(X_train),color='red')
plt.xlabel("cgpa")
plt.ylabel('package')
plt.show()
#slope value
lr.coef_ # This is m value
#intercept
lr.intercept_ # This is the b value
In the next article , I will discuss Assumptions of Linear Regression, so don’t forget to check that blog. Have a nice day!
Next article Assumptions of Linear Regression.