**Regression**

Chandler: Hi Joey!

Joey: Hi man….(putting his head into his monitor and working seriously)

Chandler: Are you ok?

Joey: Yeah, I am completely fine. I had a conversation with our vice president this morning and he gave me a task which I feel is impossible, so I am collecting relevant information to support my argument, but I am not getting anything.

Chandler: May I know what was the task given to you?

Joey: He was asking me to come up with a formula that would predict the number of products that a person would buy given his age, number of clicks in our website and the number of products added to the cart. How can someone build a model for it?

Chandler: Actually Joey that answer to that question lies in the domain of predictive analytics.

Joey: So you mean to say that it is possible?

Chandler: Yeah, it’s absolutely possible. We use a technique called Regression for this.

Joey: Oh… Can you explain more about it?

Chandler: Sure, let me explain it through a small story.

Case 1: A professor gave the below table to a student and asked him to find out a relationship between the variables assuming both are deterministic.

After visualizing the data through a scatter plot, the student figures out that the variables are linearly related.

The linear relation between the Fahrenheit and the Celsius are given by

Farenheit = Slope × Celsius + Intercept

From the graph, it is very easy to calculate the slope and the intercept

Take any two points in the graph to calculate the slope

The intercept for a line is given by

Finally, the student identifies the relation between the Fahrenheit and the Celsius as

Joey: Okay, but is this is the use of this function? Why did the professor asked him to find the relation between those variables in the first place?

Chandler: If we knew the relation between those variables,

- We can predict the value of the Fahrenheit for any Celsius value
- We can also tell how the Celsius and the Fahrenheit are related.

The above variables are deterministic in nature, so it is easy to find the relationship between them through simple algebraic calculations. What if one or both the variables are random variables? How do we find the relationship between those variables? (To know more about the deterministic and the random variables, please read the sampling distribution of sample means and the central limit theorem blog)

Case 2: Suppose assume that the same professor asked the student to conduct an experiment which is designed to measure the volume of one mole of helium gas at different temperature. Assume that the volume of helium is deterministic and the temperature is a random variable. The student conducts the experiment and obtains the below table. He has measured the volume of helium twice for each temperature. He was also not surprised to see different values of volume for the same temperature as he was aware that there must be some other factor which might influence the experiment while conducting it for the second time.

Now, the student cannot use simple algebra to figure out the relationship between the temperature and the volume of helium, since the volume of helium is a random variable. But assume that he knows that there is a linear relationship between the volume of helium and its temperature.

Volume of helium = Slope × Temperature + Intercept

Joey: Then how to find the slope and the intercept for the above function?

Chandler: And that’s where Regression comes handy.

Let me tell you another case before explaining the steps involved in regression.

Case 3: The same professor asked the student to find the relation between the number of hours spent by the students before the exam and the marks obtained by the students, assume both are random variables.

Since both are random variables it’s not possible to find the relation between the variables through simple algebra. But, we can use Regression to find it.

Let me introduce some basics terminologies which are required to explain the regression.

Independent variable: The independent variable is either deterministic or random in nature and it is not dependent on any other variable. If it is a deterministic variable, it will be manipulated in the experiment to observe the response like in the second case, the temperature is manipulated in the experiment to observe the volume of helium. The deterministic independent variable is also called as a fixed regressor likewise, if the independent variable is random variable, then it is called as a random regressor.

Dependent Variable: As the name suggests, it depends on the independent variable(s). For example: Volume of helium is a dependent variable whose value depends on the temperature. Similarly, student marks is also a dependent variable whose value depends on the no. of hours spent by the student. Usually the researcher/ analyst choose the variable which they want to predict as the dependent variable.

Parameters: The slope and the intercept are called as the parameters of the regression model.

The general simple linear regression is given by

X = Independent variable

Y = Dependent variable

β = Parameters

ϵ = Error

It is not possible to predict the exact value of dependent variable through independent variables for the second and the third case since there is some amount of randomness involved unlike the first case. The unexplained value of the dependent variable is called as the error. β0 and β1 are the true value of the above relation. Since we observe only the sample data for both X and Y, it is not possible to identify these values in real life. The primary objective of the regression model is to find an estimate for those parameters which it is defined as

Where, β0 and β1 are the estimate for the parameters β0 and β1.

In simple terms, we can say that the regression identify the best line (estimates) for the data. For the cases 2 and 3, we can draw infinite lines. Now you may get following question in your mind.

- When do we call a particular line as a best line in the regression?
- What performance measures it uses to call a particular line as the best line?

It uses a performance measure called residual sum of square error (RSS) which is defined as sum of square of residuals.

yi = Actual value of the dependent variable for the ith data point

f(xi)= Predicted value for the ith data point

The difference between the actual and the predicted value gives the residual.

Regression chooses a line which minimizes the residual sum of square error. The objective function of the regression is given by

To find a pair of parameters which minimizes the RSS, we need to take the partial derivative of RSS with respect to β0 and β1 and equate it to zero. The resultant equation gives a closed form solution to find the estimates for the parameters.

This method is called Ordinary Least Square (OLS). OLS is a method to find the estimator for the parameters in a linear regression model with the objective of minimizing residual sum of square error.

Joey: It means that if we find a line which minimizes the RSS, then that line is called as the best fitting line. Am I right?

Chandler: Not exactly. In addition to that, OLS estimate has to be Best Linear Unbiased Estimate (BLUE). Best in the BLUE stands for minimum variance. So, it means that unbiased and minimum variance is BLUE. Gauss Markov theorem states that estimate is BLUE, if it satisfies the following assumptions. These assumptions are called Classical linear regression assumptions

**Classical linear regression assumptions**

- Population regression function (PRF) parameters (β0,β1) have to be linear in parameters. Population regression function tells the actual relationship between the dependent and the independent variables. This assumption gets violated when the actual parameter of PRF itself is nonlinear.
- The independent variables of a population regression function should be additive in nature.
- Realizations of X and Y (samples) from the process should be random. Usually, time series data doesn’t satisfy this assumption since the value which the variable takes at any time instance
*t*, depends on its value at time instance*t-1*. - The expected value of error is zero.

Zero Conditional mean of error. It means that the expected value of error at any point in the range of independent variable should be zero.

Don’t confuse between error and residual. Error is defined as the difference between the true value of the process (defined by PRF) and observed value (realization) of the process. Residual is the difference between observed value and predicted value by the model. This assumption gets violated in below scenarios:

- If an important variable which affects the dependent variable is missed in the model.
- If a relation exists between anyone of the independent variable and the error.
- Measurement error in the independent variable.
- Existence of reverse causality i.e. closed loop relation exist between dependent and anyone of the independent variable.

Multicollinearity shouldn’t exist in regressors. Multicollinearity is a phenomena by which there exist a relationship between some of the independent variables.

There should be constant Variance in the error term, i.e., the variance of error at any point of the independent variable should be constant. Technically we call it homoscedasticity. Mathematically we can say that,

Error should follow a normal distribution.

Joey: What if one or many of the assumptions fail to satisfy? What will be the repercussion of the same?

Chandler: Whenever we violate any of the linear regression assumptions, the regression coefficient produced by the OLS will be either biased or the variance of the estimate will be high. Let me walk you through it one by one.

**Population regression function (PRF) parameters have to be linear in parameters.**

**Consequences if violated:** Violation of this assumption is serious because when the true mapping between the dependent and the independent variable are non-linear, and if we try to find a linear function which maps the dependent and independent variables it gives us a wrong picture completely. Violating this assumption produces serious error especially when we try to extrapolate using our assumed model

**Population regression function independent variables should be additive in nature.**

**Consequences if violated:** Violating this assumption also produces serious repercussions because non-additive relationship between the dependent and the independent variables produces unreasonable predications especially for data points beyond the range of sample data.

**Realization from the process should be random.**

**Consequences if violated:** Usually, this assumption gets violated when the data is time series. When this assumption gets violated, serial correlation occurs between the errors, i.e. there is a room for improvement in the specified function.

**Zero Conditional mean of error**.

This assumption gets violated in below scenarios.

- Important variable(s) are not included in the model: It means that particular variable which is actually present in the PRF, but is not in the model defined by us. The consequence of not including these variable depends on the correlation between the omitted variable and other independent variables. If there exist a correlation between omitted variable and anyone of an independent variable, it will bias the regression coefficients produced by OLS. If correlation doesn’t exist between the omitted variable and any of the independent variable (orthogonal), it will not produce any serious problem since the omitted variable effect will be merged with the intercept of the regression coefficient.
- Measurement of error in the independent variable: Realization of an independent variable contains some errors due to some reasons like measurement error etc. The rule mentioned for omitted variable applies for measurement error too. i.e., if measurement error is correlated with any one of the independent variables, the estimator will be biased. If not, it doesn’t produce serious consequences.
**Homoscedasticity**

**Consequences if violated**: The violation of homoscedasticity (called as heteroscedasticity) causes the regression coefficient produced by OLS to be less reliable because the data point across each independent variable will not influence equally. Heteroscedasticity also makes it difficult to forecast the true standard error that makes the confidence interval of the regression coefficient large.

If this assumption fails (Not equal variance across the levels of independent variable — Heteroscedasticity), then the estimates produces by OLS (Ordinary Least Square) will be no longer minimum variance estimate.

If the variance of an estimate is higher compared to best estimate’s variance (i.e. minimum variance estimate), t-statistic value will be smaller and make the coefficient insignificant. What appear to be insignificant coefficient may be significant if we obtains best estimate from OLS.

The main reason for OLS to produces high variance estimate for heterosedastic data is that OLS gives equal weights to all data points. This problem can be solved by Weighted Least Square estimate (WLS). WLS gives more weightage to data points which are closely clustered around the mean and very less weightage to the data points which are far away from the mean. The weight is given by

**Multicollinearity shouldn’t exist between regressors**

**Consequences if violated**: If the correlation exists between the regressors, regression coefficient variance will increase. Multicollinearity means that some of the regressors (Independent variables) are highly correlated with each other. It will make the estimate highly unstable. This instability will increase the variance of estimates. It means that if there is a small change in X, it produces a large change in the estimate of the coefficients.

Effects of Multicollinearity

- It will be difficult to find the correct predictors from the set of predictors.
- It will be difficult to find out precise effect of each predictor.

The intuition behind increase in variance due to Multicollinearity is explained below (this segment is optional, so please feel free to skip it)

The least square estimate in matrix form is given by

The variance is given by

The variance of any single estimate is given by

Multicollinearity means that some of the regressors (Independent variables) are highly correlated with each other. If the regressors are highly correlated (X matrix is not linearly independent), then the rank of matrix X is less than p+1 (where p is number of regressors). So, the inverse of(XTX) matrix doesn’t exist. But, we already knew that the closed form equation of regression estimate need (XTX)-1

Even though the multicollinearity exists between the regressors, we can find the estimates for the regression by some other methods (Pseudo-Inverse, etc.). But, those methods won’t produce a unique estimates for the coefficients like the OLS. It means that β are highly unstable. This instability will increase the variance of estimates. We can measure the increased variance of an estimate due to multicollinearity using the formula given below.

Where,

VIF = Variance Inflation Factor.

Rj2 tells how much of variance of feature xj can be explained by other features. If all the features are orthogonal to each other, then Rj2 will be zero. If 90% of variance of xj explained by other features, then the variance of βj will be inflated by 10 times. As I mentioned before, the inflation of estimate’s variance is caused due to the instability of the estimates. If βj is highly unstable then even a small change in X produces have an impact on the solution. Usually, the stability of an estimate is measures by condition number.

If condition number is very large, then βj will be highly unstable. If the regressors are highly correlated with each other, then the columns of XTX are not linearly independent, as a result the smallest Eigen value of XTX will be zero. The best way to handle the variable with high inflation factor is to drop the variable from the model, otherwise, it will increase the variance of the regression coefficient that leads to a large confidence interval which in turn leads to a high probability of being rejected.

**Error distribution should be normal**

**Consequences if violated: **If the error distribution is not normal then some of the data point influences more compared to others which in turn makes the regression coefficient less reliable

Joey: What do you mean by the term linear in linear regression?

Chandler: Let me give you some examples of linear and nonlinear regression to understand the term.

RSS is convex function for linear regression. So, we can find a closed form equation for it. The regression equation is nonlinear when it is nonlinear in parameter not in the variables considered. RSS function in non-linear regression may not be convex as linear regression. So, it is not possible to find a closed form equation to calculate the parameter’s value since it has multiple local minima. Usually, numerical optimisation algorithms are applied to determine nonlinear regression parameters. But that’s far beyond the scope of this topic let’s take up the discussion some other time.

Joey: Will hypothesis testing be of any use in linear regression?

Chandler: It depends, if we knew a set of independent variable(s) influences our dependent variable (from theory or by some means) then our only job is to find out the coefficient values by OLS method and we don’t need to do hypothesis testing for this kind of scenario.

But, if we have a set of independent variable(s) and not sure which all would influence our dependent variable, then we need to do hypothesis testing. In order to decide which independent variables are important, we need to do a hypothesis testing or interval estimation. If we use hypothesis testing to decide on the significance of independent variables then based on the P-values we can decide the set of the independent variable(s) that can be used to model the dependent variable. The null and alternative hypothesis for the regression are

Where, the null hypothesis states that the independent variable has no influence on the dependent variable and the alternate states the other way.

Joey: Okay Chandler, now let’s get into business. Can help me solve the problem given by the vice president?

Chandler: Sure Joey, we can solve it together.

So, our task here is to predict the number of products that a person would buy given his age, number of clicks in our website and the number of products added to the cart. Here our dependent variable will be the number of products purchased by the customer and the rest all will be the independent variables.

So we will be finding the coefficients for the equation below:

After running an OLS (any statistical tool like R, Python or even Excel can be used) we get the following

From the first table we infer the coefficients for the independent variables and also their corresponding p-values. We find that the variable Age and clicks per session are insignificant since their p-values are high.

So the final formal can be written as

**No. of products purchased = 2.37 + 0.0034 * Age of the customer + 0.0022 * Number of clicks in our website + 0.243 * Number of products added to the cart**

The above formula can be used for predicting the No. of products purchased by a customer. The second table gives us the performance of the regression model, we will concentrate on R square as of now. R square is a measure of how good the linear fit is. It’s value is between 0 and 1 where higher value indicates a better fit. Here the value of R square is 0.238, which is actually low, it means that the models predictions are only 23.8 % accurate. We might have ignored some variable which is crucial in the prediction of the dependent variable thus resulting in a low R squared. Also, among the three variables considered the number of clicks made by the customer is important in deciding the number of products purchased since it has the highest coefficient value.

*The author of this blog is Balaji P who is pursuing PhD in reinforcement learning at IIT Madras*