Linear Regression pt. 1: Linear Regression and its Assumptions

Anwita Ghosh
4 min readMay 21, 2023

--

Suppose a company is setting up a new analytics department, and wants to staff it with data analysts. Now, suppose the company wants to decide how much to pay these new analysts. To do so, they consider some questions:

  1. How much are analysts in other companies earning per month?
  2. What are the factors affecting their monthly pay?
  3. How are these factors related to monthly pay, and how strong is this relationship?
  4. How accurately would these factors predict pay? — And so on.

Suppose the company would like to use the data it has about analysts in other firms, and predict how much it would need to pay its own analysts. A simple way to do it would be through Linear Regression.

Linear Regression tries to predict the values of a continuous numeric outcome or response variable (like salary, sales or price) based on the values of one or more predictors or independent variables, while assuming a linear relationship between the response and predictor(s). The goal is to estimate the response based on a straight line that passes as close to the data as possible — i.e. the difference between the actual and the predicted values of the response is as small as possible.

That is, if we’re trying to predict a response (say Y), based on a predictor, X, Linear Regression would try to fit a line (like the green one) to represent the relation between the two variables (as shown by the red stars).

Linear Regression: Fitting a line to the data

That is, we’re assuming the relation between X and Y is of the form:

Linear Regression Equation

Linear regression can be of 2 kinds, based on the number of predictors used to estimate the response. When we’re using a single predictor for estimation, we call the process simple linear regression, and when there are more predictors affecting the outcome, we call the process multiple linear regression.

Most differences between the two involve the number of predictors, what the equation of the line looks like, and the mathematical estimation of the parameters of the model if we’re trying to do it manually. As for getting a software to run a linear regression for us, the code for simple and multiple linear regression would look similar (or even be almost identical).

Assumptions of Linear Regression:

  1. The response, Y, has a linear relationship with the predictor(s), X, where X can either stand by itself, as in the case of simple linear regression, or be a vector of length ‘p’, i.e. X = (X₁, X₂, … , Xₚ) when we have 𝑝 > 1 predictors (i.e. multiple linear regression).
  2. No multicollinearity in data:
    Multicollinearity refers to the problem where the predictors (X) are correlated with each other.
    While this seems like a feature of almost all real-life datasets, multicollinearity implies that the predictors that vary similarly (are highly correlated) add more or less the same information to the model.
    In other words, if two predictors are correlated, having the second column adds more to model complexity than information about the outcome (response variable).
  3. Homoscedasticity of the Residuals:
    Linear Regression assumes the ‘spread’ or variance of the residuals is homogeneous/even for all predictions across the values of the predictors.
  4. The residuals are normally distributed, which makes it easier to conduct statistical tests and construct confidence intervals for them if we need to.
  5. No endogeneity in the data:
    The problem of endogeneity occurs when one or more predictors are correlated with the residuals (in other words, the residuals are not independent of the predictors). This means that more of the variability in the response could’ve been explained by the predictors, but it got clubbed with the residuals instead.
    Linear regression assumes that any variation that can be explained by the predictors, is explained by the predictors, and the residuals cover only that which the predictors cannot catch.
    The problem of endogeneity in regression models can be handled with the help of instrumental variables, in a method known as instrumental variable regression, but more on that later.
  6. The observations are independent of each other:
    That is, the values of the variables in a given row are not affected by the values in the rows above and below it.

Now, we come to the Math behind Linear Regression, which I have covered in this post.

--

--

Anwita Ghosh

Data Scientist in FinTech, PGP Data Science, M.Sc. Economics