Linear regression is one of the most well-known and well-understood algorithms in statistics and machine learning. Before going to linear regression let’s understand what is Regression.
What is Regression?
Regression falls under the supervised learning category. The main goal of regression is the construction of an efficient model to predict the dependent attributes from a bunch of attribute variables. A regression problem is used when the output variable is either real or a continuous value i.e salary, score, weight, etc. It tries to draw the line that best fit from the data gathered from several points.
Common Types Of Regression
The following are common types of regression.
- Linear Regression
- Polynomial Regression
- Support Vector Regression
- Decision Tree Regression
- Random Forest Regression
What is Linear Regression?
Linear regression is one of the regression technique in which a dependent variable has a linear relationship with an independent variable. The main goal of Linear regression is to consider the given data points and plot the trend line that fit the data in the best way possible.
Let’s say we have a dataset that contains information about the relationship between X and Y. Number of observations are made on X and Y and are recorded . This will be our training data. Our goal is to design a model that can predict the Y value if the X value is provided. Using the training data, a regression line is obtained which will give the minimum error. This linear equation is then used to apply for new data. That is, if we give X as an input, our model should be able to predict Y with minimum error.
The linear regression model is represented by the following equation:
Linear regression most often uses mean-square error (MSE) to calculate the error of the model.
How Linear Regression works?
Let us consider that there’s a connection between how many hours a student study and marks; regression analysis can help us understand that connection. Regression analysis will provide us with a relation that can be visualized into a graph to make predictions about your data.
The goal of regression analysis is to create a trend line based on the data. This then allows us to determine whether other factors apart from hours of study affect the student marks, such as level of stress, etc. Before taking that into account, we need to look at these factors and attributes and determine whether there is a correlation between them. Linear Regression can then be used to draw a trend line which can then be used to confirm or deny the relationship between attributes.
How do we determine the line that best fits the data?
The line is considered best fit if the predicted values and the observed values is approximately same. In simple words, the sum of distance of data points from the line is minimum then it is a best fit line.
The Line is also called the regression line and the errors are also known as residuals which are shown below. It can be visualized by the vertical lines from the data point to the regression line.
error ,in this case, is the sum (mean or standard deviation) of the point from the line chosen.
After the model is built, We need to check the difference between the values predicted and actual data, if it is not much, then it is considered to be a good model. Below is a metric tool we can use to calculate errors in the model.
R — Square (R2) score:
Total Sum of Squares (TSS): The measure of how a data set varies around a mean. The TSS tells us the variation in the dependent variable.
TSS = Σ (Y — Mean[Y])2
Residual Sum of Squares (RSS): sum of the squared differences between the actual Y and the predicted Y. The RSS tells us how much variation of the dependent variable is not explained by our model.
RSS = Σ (Y — f[Y])2
(TSS — RSS) measures the amount of variability in the response that is explained by performing the regression.
R2 score can be used to check all regression model’s performance.
A simple Linear Regression Example:
import numpy as np
from sklearn.linear_model import LinearRegression
x = np.array([1, 2, 3, 4, 5, 6]).reshape((-1, 1))
y = np.array([2, 5, 6, 8, 9, 12])
model = LinearRegression()
Y_pred = model.predict(x)
r_sq = model.score(x, y)
print('coefficient of determination:', r_sq)
plt.plot(x, Y_pred, color=’red’)
Linear regression project link: