Linear Regression — Icebreaker to Machine Learning Algorithms.

Rishi Kumar
Nerd For Tech
Published in
6 min readMay 21, 2021

--

Linear Regression is the first step to climb the ladder of machine learning algorithm. Linear Regression comes under supervised learning where we have to train the Linear Regression model to predict data.

Introduction:

  • Francis Galton was studying the relationship between parents and children in 1800's.
  • He stated there is correlation and causation between parents and children.

The Goal is to find the best fit line by minimizing the vertical distance, i.e. The error between predicted and actual value.

Linear regression is used for finding linear relationship between target and one or more predictors. There are two types of the linear regression- Simple and Multiple.

→ Input, Predictor or Independent Variable X

→ Output, Response or Dependent Variable Y.

To identify the best fit line:

  • It has to pass through the centroid (Mean of X = Mean of Y)
  • It should have less error.
  • Then we should find slope (m) and intercept (b).

Simple Linear Regression:

Simple linear regression is useful for finding a relationship between two continuous variables. One is the predictor or independent variable and another is a response or dependent variable. Relationship between two variables is said to be deterministic if one variable can be accurately expressed by the other. For example, using temperature in degree Celsius it is possible to accurately predict Fahrenheit. Statistical relationship is not accurate in determining the relationship between two variables. For example, relationship between height and weight.

The core idea is to obtain a line that best fits the data. The best fit line is the one for which total prediction error (all data points) is as small as possible. Error is the distance between the point to the regression line.

Figure 1: Simple Linear Regression

Real Time Example:

We have a dataset which contains information about the relationship between ‘number of hours studied’ and ‘marks obtained’. Many students have been observed and their hours of study and grade are recorded. This will be our training data. The goal is to design a model that can predict marks if given the number of hours studied. Using the training data, a regression line is obtained which will give a minimum error. This linear equation is then used for any new data. That is, if we give a number of hours studied by a student as an input, our model should predict their mark with minimum error.

Simple Linear Regression equation Y (Pred) = m (x) + b.

Where m = slope, x = predictor, b = intercept.

To derive Slope (m):

Equation for calculating slope(m) and intercept(b).
Figure 2: Steps for calculating slope (m) and intercept (b) for simple linear regression.
Figure 3: Intercept Calculation
Figure 4 : Slope Calculation

Exploring Slope (‘b1’ or ‘m’):

  • If slope > 0, then X (predictor) and y (target) have a positive relationship. That is increase in X increases y.
  • If slope < 0, then X (predictor) and y (target) have a negative relationship. That is increase in X decreases y.
  • If slope = 0, then X (predictor) and y (target) have no relationship.

Exploring y-intercept (‘b0’ or ‘c’):

  • If the model does not include x=0, then the prediction will become meaningless with only b0. For example, we have a dataset that relates height (x) and weight (y). Taking x=0 (that is height as 0), will make equation has only b0 value which is completely meaningless as in real-time height and weight can never be zero. This resulted due to considering the model values beyond its scope.
  • If the model includes value 0, then ‘b0’ will be the average of all predicted values when x=0. But, setting zero for all the predictor variables is often impossible.
  • The value of b0 guarantee that residual have mean zero. If there is no ‘b0’ term, then regression will be forced to pass over the origin. Both the regression co-efficient and prediction will be biased.

Multiple Linear Regression:

Multiple Linear Regression is useful for finding the relationship between n amount of independent variables and one dependent variable. In order to find the optimized slope and intercept, i.e. The value which has low error between actual and predicted. To find the optimized slope and intercept value gradient descent heart of the linear regression algorithm is used.

Figure 5: Multiple Linear Regression formula.

Gradient Descent:

Gradient Descent is a first-order iterative optimization algorithm for finding the minimum of a function. It works on recursive method to find the optimized slope (m) and intercept (b). Cost function (error) should be reduced.

Figure 6: Cost Function (Error)
  • Now, we gonna derives optimized slope and intercept using partial derivatives.
  • Partial Differentiation is a method to differentiate a function with respect to one independent variable while treating others as a constant. It is represented as
Figure 7 : Partial Derivative Formula.
  • Gradient Descent curve arrives when we plot graph between slope and cost function.
Figure 8: Gradient Descent.
  • Partial derivative equation w.r.t slope and intercept.
Figure 9: Partial Derivative
  • Example for calculating minimum error using Gradient Descent.
Figure 10: Example problem.
Figure 11: Formula to calculate the new intercept.
  • To calculate the new intercept value old_b has to be subtracted from step size.

→ Step_size = Partial Derivative * Learning Rate

Figure 12: Expanded Formula.
  • Learning Rate implies the distance to move and Partial Derivative implies in which direction to move.
  • If we have larger Learning Rate value it will deflect more, so choosing minimum value is good.
Figure 13: Steps in calculating optimized intercept.
Figure 14: Optimized Intercept Value.
  • We have to iterate this process up to the global minimum point which has low errors, but the error won’t be zero.
Figure 15: Program to find Gradient Descent.

Gradient Descent Code : https://github.com/Rishikumar04/Data-Science-Training/blob/main/Linear_Regression/01_Gradient%20Descent.ipynb

Assumptions of Linear Regression:

  • Linear Relationship
  • No or less multi-collinearity
  • Homoscedasticity
  • Normality in Residual.

Linear Relationship:

→ There should be a relationship between X and y, either it is a positive or negative relationship.

→ Linear Regression is sensitive to outliers.

Figure 16: Scatter plot.
  • Positive Linear Relationship between one of the independent variable (X) and dependent variable (y).

Multi Collinearity:

  • Multi collinearity states that, there is a strong relationship between two independent variables.
  • If there is a relationship, then there will be some change in co-efficient and increase in the variance. So, multi collinearity has to be removed from the dataset before feeding it to the model.

Homoscedasticity:

Homo = Same, Scedasticity = Variance

  • Residual (Error) should have the same variance.
  • Spread across the data is same.
  • Scatterplot is to be plotted predicted values and its errors, the result plot should have the same variance of error.
Figure 17: Homoscedasticity and Heteroscedasticity

Normality in Residuals:

  • Errors should follow a normal distribution.
Figure 18: KDE plot illustrating normality in residuals.

Other solved examples for Linear Regression: https://github.com/Rishikumar04/Data-Science-Training/tree/main/Linear_Regression.

This is an educational post made by compilation of materials (like Jose Portilla’s Udemy course, my mentor Mohamed Imran (https://www.linkedin.com/in/imohamedimran/) etc..) that helped me in my journey.

Thanks for Reading.

--

--

Rishi Kumar
Nerd For Tech

I'm a passionate and disciplined Data Science enthusiast working with Logitech as Data Scientist