Machine Learning 101

Part 4: Linear Regression

Bzubeda
4 min readDec 8, 2023

In the previous part — Part 3: Types of Machine Learning Models we have gone through different Types of Machine Learning models and related examples.

From the previous part, we know that when a Machine learning algorithm is used for training the model and predicting the target/output based on the existing feature columns, it is called a Supervised Machine Learning algorithm.

Using an example, let’s understand what Linear Regression is and how it works.

Image Source — Cricket Sports

In simple words, Linear Regression is a Supervised Machine Learning algorithm that uses a mathematical function while training the model, to predict the numerical target value.

Note: Mathematical function —

Image Source — Linear Regression Equation

Coefficients xi denotes independent variables and yi denotes dependent variable.

Suppose we have a cricketer named Sachin, we want to predict the cricketer’s performance in the current match and stadium. The target value is a numerical score ranging between 1 and 10, where 1 means the least performance and 10 means the highest performance.

Here, the features based on which the Machine Learning model makes a prediction are — the previous average runs Sachin made, the average number of wickets Sachin took, and if Sachin made any partnerships during his batting innings in the same stadium.

Let’s say the Machine Learning model gave us the target prediction value of 7 based on Sachin’s previous performance features (average runs — 70, average wicket — 2, made partnership — Yes), looking at this result we can conclude that Sachin's performance will be fairly good in the current match. Now, in this case, Linear Regression can be used to predict numerical target value (performance score). Being a Parametric Machine Learning algorithm (Refer: Type of Machine Learning Algorithms), there are some assumptions made by the algorithm about the data that you must know before using it.

The following are the assumptions made for using Linear Regression

1) Linearity

Image Source — Linearity

In our current example, we can say that an increase in runs or wickets features, results in an increase in overall performance score. This shows that the target performance score is correlated to the independent feature variables. When we plot the dependent target and independent features, we get an approximate straight line. So, there is a linear relationship between the features and the target.

2) No Auto-correlation

This assumption simply assumes that the runs made by Sachin are not affected or dependent on the runs made by any other cricketer. It is also not affected by the observations of other features such as wickets taken and partnerships made by him or any other cricketer. Even the error terms (predicted — actual value) must be uncorrelated with each other.

Note: Observations are nothing but the values in the feature columns.

3) Homoscedasticity

Image Source — Homoscedasticity

Variance — the measure of how spread out the set of values is from the mean (average).

Suppose, after the match, it was observed that Sachin performed better than expected with a performance score of 8.5. Here, we can say that predictions are not always 100% accurate, they are rather approximately accurate. When we calculate the difference between actual and predicted values we get some error terms or residuals.

Homoscedasticity assumes that when we plot a scatter graph for the actual vs predicted, the values are evenly or uniformly spread out making their variance consistent regardless of the values of feature variables (runs, number of wickets, etc).

4) Normality in Error terms

Image Source — Residual distribution

It assumes that the error terms follow a normal distribution. When we plot the values of error terms in a scatter graph it gives a bell-shaped pattern in the middle mean.

5) No Multi-collinearity

Here, it is assumed that the number of runs made by Sachin should not strongly affect or depend on any other feature variables. When we plot the feature variables against each other they must not have any linearity or linear relationship with each other.

6) No Endogeneity

It assumes that the runs made, wickets taken, or Partnerships made by Sachin must not affect the error terms or have any linear relationship with the error terms.

Conclusion:

Let’s say you want to predict a numerical target value like the performance score as we saw in the above example, if all the assumptions are met, it is safe to say that Linear Regression is a go-to Machine Learning algorithm.

Stay tuned, in the next part we will understand What is Logistic Regression? and How it works. Please share your views, thoughts, and comments below. Feel free to ask any queries.

References:

--

--