Understanding Linear Regression: Everything You Need to Know!

Kriti Yadav
7 min readFeb 8, 2023

--

The supervised machine learning algorithms can be broadly divided into two groups: classification and regression algorithms. The algorithm is said to be a regression algorithm if the output variable of the model is continuous, whereas the algorithm is said to be a classification algorithm if the output is categorical.

Linear Regression is a statistical method that models a linear relationship between a dependent variable (output) and one or more independent variables (inputs). It is used to make predictions about the dependent variable based on the values of the independent variables. The goal is to find the best-fitting line (regression line) that minimizes the sum of the differences between the observed and predicted values. The line can be represented by an equation of the form

Image by alpharithms.com

where Y is the dependent variable, X=[X1, X2, …, Xn] are the independent variables. Once the model is trained, it can be used to predict the dependent variable for new data points by plugging in the values of the independent variables into the linear equation.

Image-Researchgate

Linear Regression assumes that the relationship between the dependent variable and the independent variables is linear and has some limitations in modelling complex relationships. However, it is still widely used as a simple and effective tool for predictive modelling in a variety of applications.

Cost Function

A cost function is a mathematical representation of the difference between the predicted values and the actual values in a model. In linear regression, the cost function measures the difference between the predicted values (obtained from the regression line) and the actual target values. The goal is to minimize the cost function to obtain the best-fit line that accurately represents the relationship between the independent and dependent variables.

The most commonly used cost function for linear regression is the mean squared error (MSE) which calculates the average of the squares of the differences between the predicted and actual values. The best-fit line is determined by finding the values of the model’s parameters that minimize the MSE.

Image-Board Infinity

To minimize MSE, you need to find the values of the coefficients (slope and intercept) that minimize the sum of squared residuals (differences between the observed and predicted values). This can be done using analytical methods, such as the normal equation, or numerical optimization methods, such as gradient descent.

Evaluation Metrics for Linear Regression

The strength of any linear regression model can be assessed using various evaluation metrics. The following are common metrics used to evaluate the performance of a linear regression model:

  1. Mean Squared Error (MSE): measures the average squared difference between the predicted values and the actual values.

2. Mean Absolute Error (MAE): measures the average absolute difference between the predicted values and the actual values.

3. R-squared (R2): represents the proportion of the variance in the dependent variable that is explained by the independent variables. It ranges from 0 to 1, with 1 indicating a perfect fit.

4. Adjusted R-squared: adjusts R-squared for the number of independent variables in the model, providing a more nuanced view of the model’s performance.

5. Root Mean Squared Error (RMSE): this is the square root of the MSE, providing a more interpretable error measure in the same units as the dependent variable.

Assumptions of Linear Regression

The following are the assumptions for a linear regression model:

(1) Linearity: The relationship between the independent and dependent variables should be linear. However, this assumption is not always met in real-world data, and it can lead to incorrect predictions if the true relationship between the variables is non-linear. For Example:

Predicting Housing Prices: If the relationship between the square footage of a house and its price is non-linear, a simple linear regression model would not be appropriate. For example, the price of a house might increase dramatically with a small increase in square footage in a high-end housing market, but not in a more affordable housing market.

(2) Independence: The observations should be independent of each other. If this assumption is not met and there is some form of dependence between observations, the results of linear regression can be biased and unreliable. For Example:

Sales of ice cream with temperature: Consider a linear regression model that is trying to predict the number of ice creams sold at a store based on the temperature on a given day. If we collect data over a period of time and observe a strong correlation between temperature and ice cream sales, we may assume that a higher temperature leads to higher ice cream sales. However, if the data was collected over a period of time that only included weekends, the relationship between temperature and ice cream sales may actually be confounded by the day of the week. On weekends, more people are likely to be off work and out of the house, leading to both higher temperatures and higher ice cream sales. In this case, the relationship between temperature and ice cream sales is not truly causal, and the results of a linear regression would be biased.

Homoscedasticity: Homoscedasticity refers to the property of having equal variances of the error terms across all levels of the independent variable in a linear regression model. When homoscedasticity is not present, it is referred to as heteroscedasticity. For Example:

Predicting Housing Prices: Imagine we’re trying to predict housing prices based on square footage. In a homoscedastic model, the error in the prediction would be consistent for all sizes of houses, regardless of square footage. This would mean that the variance of the error terms would be the same for small and large houses.

However, in a heteroscedastic model, the variance of the error terms would be unequal. For example, the variance of the error terms for large houses may be much higher than for small houses. This means that our model would be less reliable for making predictions about large houses, as the error in our predictions would be much higher. Heteroscedasticity can result in biased and inefficient parameter estimates, making the linear regression model less useful for making accurate predictions. Hence, it is important to check for homoscedasticity in a linear regression model and take steps to correct it if necessary.

Normality: The normality of errors is a key assumption in linear regression, as it allows us to make statistical inferences about our model. If the errors are not normally distributed, it can negatively impact the validity of our results. Here are two examples to illustrate this point:

(1) Confidence Intervals: In linear regression, we often use confidence intervals to quantify the uncertainty of our model’s predictions. If the errors are not normally distributed, the confidence intervals calculated from the regression may not be accurate, leading to incorrect inferences about the uncertainty of our predictions.

(2) Hypothesis Tests: Another important use of linear regression is to test hypotheses about the relationship between the independent and dependent variables. If the errors are not normally distributed, it can affect the validity of hypothesis tests, such as the t-test, leading to incorrect decisions about the significance of the relationship between variables.

No multicollinearity: The independent variables should not be highly correlated with each other. Multicollinearity is a drawback in linear regression because it can lead to unreliable and unstable estimates of the regression coefficients. Consider the following example:

Suppose we have a linear regression model to predict the price of a house based on two predictor variables: the square footage of the house (sqft) and the number of rooms (rooms). However, sqft and rooms are highly correlated, since larger houses typically have more rooms. In this case, we would expect to see a high degree of multicollinearity between the two predictor variables.

If we fit a linear regression model with sqft and rooms as the predictors, the coefficient estimates for sqft and rooms would likely be very sensitive to the exact data that we use to fit the model. This means that if we use a different dataset, or if we make small changes to the original dataset, the coefficient estimates for sqft and rooms could change dramatically.

As a result, it would be difficult to trust the results of the regression analysis. For example, the coefficient for sqft might be positive in one analysis, but negative in another, depending on the data used. This makes it difficult to interpret the effect of sqft on the price of a house with any degree of confidence.

These assumptions help ensure that the results of the linear regression model are valid and reliable. Violations of these assumptions can lead to incorrect conclusions and invalid inferences.

Happy Learning!!

About the Author: I am Kriti Yadav, Data Scientist. My current work focuses on computer vision, deep learning, natural language processing, and machine learning. Please reach out to me via my Linkedin profile if you have any questions.

--

--