Regression: Linear Regression
In this blog, we will discuss our first machine learning model i.e. Regression.
Regression is a supervised learning technique to model the relationship between features (the independent variables in the data) and the target (the dependent variable in the data). It helps us to understand how the value of the dependent variable is changing with the independent variable. It predicts continuous values.
For example, we need to predict the temperature, we will take past data which will have independent variables like altitude, location, the month of the year, etc., and dependent variable temperature and then model a relationship between them and predict the temperature when the new set of independent variables is given.
Regression gives us a line or a curve by plotting dependent and independent variable and then give us a line or curve which fits the data points using which we make predictions about the data.
Underfitting
- It is a condition in which the model is unable to find a relationship between the given dependent and independent values. It occurs due to a small amount of data
Overfitting
- It is a condition when the model tries to fit each data point of the known data and due to this it is unable to perform on unseen data or the test data.
LINEAR REGRESSION
Linear regression is the simplest and most common supervised learning model which as the name suggests is a regression model but finds a linear relationship between the dependent(target) and independent variables(features).
It plots a linear line that tries to fit the data points when the dependent and independent variables are plotted on a cartesian plane.
Simple Linear Regression
Consider a case when there is only one independent variable(feature) and we need to predict a value(target), mathematically if we want to relate feature and target linearly we will do it as f(x)=mx + c
Here we will do a little manipulation, I can write f(x1) = (w1)(x1) + (w0) where x1 is the independent variable (feature), w1 is termed as the weight of the feature and w0 is the intercept which is obtained when the value of feature is zero.
Multiple Linear Regression
In this case there are more than one features and the model relates the target linearly as
f(x1, x2, x3, x4, …) = w0 + (w1)(x1) + (w2)(x2) + (w3)(x3) + …
here w1, w2, w3, … wn is the weight of each feature x1, x2, x3, … xn.
COST FUNCTION
As we know different values of w’s will give us different linear relations so it is our task to find out the best fit line which has the least error.
The cost function is used to optimize the coefficients or the weights (w) and give the measure of how our model is performing on data. For linear regression, we will use the Mean Squared Error cost function.
Residuals: it is the perpendicular distance between the actual value y and the predicted value f(x).
If the data points are far from the regression line then the residual will be high and so the error, and if the data points are closer to the regression line then the residual will be less and so the error and this is how we choose a regression line.
We use Mean Squared Error function to know the accuracy of the linear relation that the model has plotted. As its name suggests it is the mean of the square of residuals.
MSE= (squared sum of residuals) / N, where N is the total number of data points.
The line along which the Mean Squared Error will be less will be considered the best fit line and our model will be most accurate.
LINEAR REGRESSION USING PYTHON
We will use the Scikit Learn library, which is a freely available machine learning library that features various classification, regression, and clustering algorithms.
- Importing the library and linear regression model
from sklearn.linear_model import LinearRegression
- Importing tools for manipulating the dataset
#import pandas and numpy to use and manipulate the dataset in your
#program
import pandas as pd
import numpy as np#this is a function that splits the data into two subsets i.e. #training data and test data
from sklearn.model_selection import train_test_split as tts#importing the mean squared error function
from sklearn.metrics import mean_squared_error as mse
- For example, we have csv file
data=pd.read_csv(“<file_path>”)#assuming the data is already cleaned
y = data["<target_column>"]
x = data["<features_required>"]#splitting the data into test and train data
x_train,x_test,y_train,y_test=tts(x,y,test_size=0.2,random_state=42)#the parameter 'test_size' is used to define the size of test dataset, here it says 20 percent of original
#the parameter 'random_state' is used to shuffle the dataset in every execution, otherwise this function return the same split
- Training the model
#creating a Linear Regression model named 'model'
model=LinearRegression()#training the model
model.fit(x_train,y_train)
- Predicting the target values
prediction = model.predict(x_test)
- Finding the accuracy
#finding the mean squared error between the correct and predicted
#values
error = mse(y_test,prediction)#find the square root of the mean squared error to get a more
#concrete idea about the accuracy
r_error = np.sqrt(error)
- From the value of the root mean squared error we can determine the accuracy of our model
I hope you all understand what is linear regression and how we measure the accuracy of the model.
And stay connected to know more about Machine Learning models.