# A Deeper Dive Into Linear Regression!

Do we really know what a Linear Regression is? Let’s find out.

**Linear Regression: **Linear approximation of causal relationship between two or more variables.

What does the

meaningmean?

*Let us try to answer this question first — Does* **year of birth** *affect our* **present age in years** *? Of course it does!*

*Greater the* **year of birth** *is, lesser will be *our **present age in years** *and smaller the ***year of birth***,* greate*r will be* **present age in years**.

Hence, there is a causal relationship between **year of birth** and **present age in years**.

**INDEPENDENT VARIABLE** : The variable(s) or feature(s) or property(s) that causes or gives meaning to another variable. It’s value does not change when some observation is changed. Hence the name ‘Independent’

**DEPENDENT VARIABLE :** The variable or feature or property (that is caused by a variable or set of variables. It’s value changes with the respective independent variables that it is ‘dependent’ upon.

**RELATION BETWEEN DEPENDENT AND INDEPENDENT VARIABLES :**

Dependent variable = f {Independent variable(s)}

Note :Here, ‘year of birth’ is independent variable and ‘present age in years’ is dependent variable

If we denote dependent variable as **y** and independent variable as **x**, we can say to be function of x for a **UNIVARIATE LINEAR REGRESSION **(because only one independent variable x is present) to be :

y = f(x)

If there are more independent variables — x1, x2, x3 … that affect the dependent variable y, the regression is called **MULTIVARIATE LINEAR REGRESSION** and the function can be re-written as :

y = f(x1, x2, x3 ….)

Note: There can be only one dependent variable y and any number of independent variables x1, x2, x3 …

# In a nutshell, aim of a Linear Regression model is to predict the dependent variable y when the independent variable(s) are given as input.

Let’s jump back to the example we have talked about previously. Independent variable is ‘**year of birth**’ and the dependent variable is ‘**present age in years**’. Once we give a random ‘**year of birth**’ as input, the model should be able to predict ‘**present age in years**’.

x = year of birth

y = present age in years

Since there is only one independent variable, this is univariate regression. The relation can be written as :

present age in years = f {year of birth}

A simple univariate linear regression model for error ‘E’ and constants ‘a’ and ‘b’ looks like this :

y = ax + b + E

In most of the cases, E = 0. So, the model becomes,

y = ax + b

Now, if we provide the data set as shown below, with the help of a number of iterations and gradient descent, the computer will find optimal value for coefficients ‘a’ and ‘b’

So, the model becomes

y = (-1) * x + (2019)

Or in other words,

present age in years = 2019 - year of birth

So, if I input ‘year** of birth’ **as 2001, ‘**present age in years’** will be calculated as 2019 minus 2001 i.e predicted value will be **18**

Note: The above model is effective for finding ‘present age in years’ only in the year 2019. If we want it to be effective for all the years to come; what extra information do you think computer needs to know? Yes, It is the year we are training our model! To incorporate it, we can make this amultivariate regressionwith ‘present_year’ as a new feature. So, the resultant model will be :

present_age_in_years = (1) * present_year+ (-1) * year_of_birth

# STATISTICAL SCORES AND COMPARISON OF OUR MODEL

If you are more interested on the statistical details of your model, **statsmodels.api** is something that you are looking for.

**statsmodels.api **has beautifully arranged *summary table* that provides a set of undeniably helpful information related to our model.

**1. T-STATISTIC**

It is measured for coefficients in our hypothesis.

The

t-statisticis the coefficient estimate divided by the standard error.

*Significance of t-statistic is that, if our regression is based on a sample consisting of 30 or more observations, the coefficient is significant with >95% confidence when it’s t-statistic satisfies th*e condition, 2 < **t-statistic** < -2.

**2. P > | t |**

It represents P value from hypothesis testing. If **P-VALUE < 0.05,** predicting variable is **significant**

**3. DETERMINANTS OF A GOOD REGRESSION — ANOVA**

**i. Sum of Squares Total (SST) : **Sum of squared difference between observation and it’s mean *i.e Dispersion of observation around the mean (similar to variance)*

**ii. Sum of Squares Regression (SSR) : **Sum of squared difference between predicted value and it’s mean of dependent variable *i.e It describes how well the line fits our data.*

If SSR = SST, model captures all the observed variability and is perfect!

**iii. Sum of Squares Error (SSE) : **sum of squared difference between observed value and predicted values.* Smaller the error, better will be our model.*

**RELATION BETWEEN THE THREE : **The more we decrease SSE, more the SSR will increase and vice versa.

SST = SSR + SSE

**4. R-SQUARED VALUE**

It is an important indicator to say how good our regression is.

R square measures the variability of our model (how good it fits all the data). It lies between 0 to 1 but there is no rule of thumb that a fixed R-squared value is a good score!

*For example, in chemistry, 0.7 approximate is considered best and in social science, 0.4 is considered to be fantastic.*

R squared value = 1 implies that model explains entire variability. Whereas, R squared value = 0 implies model explains none of the variability.

**5. ADJUSTED R-SQUARED VALUE**

It is used for comparison of performance of models.

Adj. R-Squared Value < R-Squared Value

Because it penalizes ‘excessive use of variables’, it is good for comparison of models.

Conditions for using Adjusted R-Squared Value :

i. Same dependent variable, y should be predicted.

ii. Same data set should be used.

**6. COMPARISON : GOOD MODEL vs BAD MODEL**

- Adusted R-squared scores : Higher the score, better will be the model
- Coefficients with P values <0.05 are significant
- F-statistic : Lower the f-statistic, closer to non significant model
- Prob(F-statistic) : looooower it is, better the model will be. For example, 0.0000000000000000000000017

An Invaluable Tip: We can add upto 100 independent variables and probably predicting power of the model be outstanding,BUT THIS MAKES REGRESSION FUTILE!Simplicity is better rewarded than higher explanatory power. This introduces us to the concept of

feature selection.

# CONDITIONS FOR LINEAR REGRESSION

Biggest mistake we can ever make is perform a regression which violates one of these assumptions —

- LINEARITY
- NO ENDOGENEITY
- NORMALITY AND HOMOSCEDACSTICITY
- NO AUTO-CORRELATION OF ERRORS
- NO MULTI-COLLINEARITY

**I. LINEARITY**

linearity is an essential assumption because the equation itself is linear.

*How to verify if relationship between two variables is linear?*

Plot an independent variable x1 against a dependent variable y (on a scatter plot background). If the result is a straight line, linear regression is appropriate. If the plot results in a curve, linear regression should not be used.

FIXES :We can transform quadratic, log or exponential equations to linear equations and satisfy the condition oflinearity.

**II. NO ENDOGENEITY**

Prohibition of link between independent variables and errors.

**Mathematically** : rho(x,E) = 0 (No correlation)

E →Error

x →Independent variable

If **rho(x,E) != 0**, this is called **OMITTED VARIABLE BIAS (OVB)**

Omitted variable bias occurs when we forget to include a relevant variable and hence degrades the model !!

*For example, if we forget to include a relevant variable, x* and include some relevant variable x. Now,*

*x and y are somewhat CORRELATED*

*x* and y are somewhat CORRELATED*

*this implies,*

*x and x* are somewhat CORRELATED.*

*Here as we did not include x in the model, everything that cannot be explained goes into x (i.e model is not able to explain correctly because there is no x* in it )*

*Hence, x* becomes **THE ERROR**!!*

This implies, x and E are correlated and endogeneity is present!!!!!!!

FIXES: OVB is always sneaky and different. Only experienced and advanced knowledge will help to fix it.

**III. NORMALITY AND HOMOSCEDACSTICITY**

**i. Normality : **We assume that the error term is normally distributed ( It is not important for creating regression, but for gathering inferences)

*For example, T Statistic and F statistic work because we assume error term to be normally distributed*

FIXES :If the error term is not normally distributed, we can applyCentral Limit Theoremto transform it into normal distribution.

**ii. Zero mean of error terms : **If the mean of errors is not zero, line is not best fitting one.

FIXES :Having an intercept solves problem. Hence, it is unusual to occur.

**iii. Homoscedasticity : **Error term must have equal variance one with another

*If the variances are not equal, and have some pattern for instance, graph looks regression wont work.*

FIXES :i. Lookout for OVB and try to fix it with the help of field experts.

ii. Lookout for outliers and exclude them while training the model.

iii. Log Transformation may remove patterns.

**IV. NO AUTOCORRELATION OF ERRORS aka NO SERIAL CORRELATION**

It is highly prevalent in time series data instead of regualar cross sectional data.

**Detecting autocorrelation:**

i. Plot scatter plots and try to find patterns. If there are no patterns, we are safe.

ii. Durbin-Watson Test : Values fall between O to 4. 2 means no autocorrelation and less than 1 or grater than 3 means high autocorrelation.

FIXES :There are no fixes! we cannot apply linear regression to this kind of model

**Alternative models :**

i. Auto Regressive Model

ii. Moving Average Model

iii. ARMA

iv. ARIMA : Auto Regressive Integrated Moving Average Model

**V. MULTICOLLINEARITY**

There should be no multicollinearity present between two independant variables

If rho(x1, x2) = 1 or even 0.87, linear regression should not be applied. This is indicated by p-value in summary table with larger value.

**Prevention : **Before starting to build a model, try to find correlation between each pair of independent variables.

FIXES :i. Drop one of the two variables

ii. Transform the two variables into one variable. For example, take average.

iii. You may keep them both. But proceed with extreme caution!