Concept of Linear Regression Model for Newbies!
What is Linear Regression?
Linear Regression is a simple statistical regression method used in predictive analytics. As the name suggests it helps you to learn the linear relationship between two or more variables that are continuous numeric and are on the X-axis and Y-axis. All the variables are bifurcated into two categories. A single response variable which is the output of the data is on the Y-axis as the dependent variable rest of all the variables are on the X-axis as the independent variables affecting the dependent variable.
For example, if we are predicting sales for a particular brand for this month and want to predict which marketing platforms (supposing News, Radio, and Newspaper) have contributed to making sales. Then the gross amount from sales will be on Y-axis as the dependent variable and the marketing platforms in which we have invested will act as independent variables on the X-axis. The best way to visualize, which independent variable has a linear relationship with the dependent variable is to design a pair-plot by using the seaborn library.
If there is a single predictor variable against the target variable, such linear regression is called simple linear regression (SLR). And, if there is more than one predictor variable against the target variable, such linear regression is called multiple linear regression (MLR). The relationship between these variables could be zero, positive or negative. SLR is for your basic understanding as one independent variable is not enough to capture all the uncertainties of the target variable. Hence to make an accurate prediction, you need multiple variables. Let’s start by building your foundation through Simple Linear Regression.
Simple Linear Regression:
An SLR attempts to explain the relationship between a dependent variable and an independent variable using a straight line on a graph. The independent variable is also known as the “predictor variable” as they will be predicting the highs and lows of the dependent variable. On the other hand, the dependent variable is also known as the “target variable”
Statistics behind it:
To calculate the best-fitted straight-line linear method use the slope-intercept form formulation:
The statistical formula for a simple linear regression is
y = m*x + c; where
y is the dependent/target variable on the Y-axis;
x is the independent/ predictor variable of the X-axis;
m is the slope of the line denoting the linear relationship between x and y(could be positive or negative slope). Here m values are the coefficients of the predictor variable x. and,
c is the constant. A constant is a point at which the linear line will intercept on the y-axis when x will be 0.
Interpretation of the Equation:
The m values are called the model coefficients or model parameters. For this regression line, c is the intercept and m is the slope. This means an increase in the unit of x, will increase the number of y units, or a decrease in the unit of x will decrease the number of y units i.e the target variable will be positively or negatively impacted by the predictor variables. Also, when the x is 0, then the value of y is where the line intercept at the y-axis.
Best Fit Line (Residual Analysis):
In regression, the best fit line is a line that fits the linear regression plot in the best way. To get the best fit line, we calculate the Residuals. Let’s understand more about Residuals and see how you define a notion of a best-fit line.
Residual Value is the difference between the actual value and predicted value denoted by a straight line,
e = Yi — Ypred, where
Residual value is denoted as ‘e’, as residuals are nothing but the distance between the actual and predicted values that are denoted in error terms.
Yi is the actual value or measured value, and
Ypred is the predicted value denoted by a straight line.
The idea is to minimize the total errors to achieve minimal distance by squaring them for acquiring the best fit line. So if we say, e is the error for a particular data point, we square and sum all of the errors. The mathematical approach is being followed as:
Residual (Errors) = Actual Values — Predicted Values
Square of Sum of Residuals (Errors i.e. RSS)= (Sum(Actual- Predicted Values))^2
i.e
This is also known as Residual Sum of Squares (RSS). So our objective is to minimize the Residual Sum of Squares (RSS), which would help us to pick that straight line having the best parameters of c (y-intercept) and m (slope of the line). If you have a way in which you can minimize the RSS you will be able to find the best possible values for a straight line.
This process of minimizing the RSS is done with a method called Gradient Descent with the concept of Cost Function. In our explanation, our cost function is the RSS. And gradient descent is an iterative and computational approach to minimize our cost function. This process of gradient descent can be executed with the help of a python algorithm. If you want to understand the concept of gradient descent in detail, please refer to this article linked here.
Strenght of SLR:
After determining the straight fit line, there are a few critical questions that you need to answer.
- How well does the best fit line represent the graph?
- How well does the best fit line predict the new data?
Usually, what happens is? The value that we get in RSS is in absolute terms but instead, we need the values in relative terms. As you know, absolute is raw and non-comparative, whereas relative is with relation to the standard of living of other class groups. So, to convert the RSS value, we first compute Total Sum of Squares (TSS), which is nothing but the error of the distance between the actual points and the average line on the graph that we get by computing the average sum of our actual points. Please refer to the graph below.
The equation of TSS is as follows:
ӯ is the average line represented as the dotted line in the middle of the graph.
Yi represents all the data points in respect of the y-axis.
So the whole idea is if you have a linear model where you do not have independent variables and you just use the intercept value, you can create a very basic model where the y-intercept value (c) = ӯ (average of all data points). This basic model, if built by using independent variables should be better than the basic model where ӯ = c. This becomes like a reference for us and we would compute the ratio of RSS / TSS as a normalized quantity which we will tell us how good a model is. This is how we compute R-squared (r2) value.
where RSS = Residual Sum of Squares, and
TSS = Total Sum of Squares.
Conclusion = Higher the R-square value, the better is the model.
Assumptions of Linear Regression:
While building the model, you are making inferences on the population using a sample dataset. In this inference, we assume that the target variable and the input variables are linearly dependent. But this assumption is not enough to generalize the results we obtain, as the size of the population is much larger in reality than the sample on which we are building the model. Thus, you need to have certain assumptions in place to make inferences.
- There is a Linear Relationship between x and y:
If there is no linear ship witnessed then it is of no use to fit a linear model between them.
2. Error terms are normally distributed with a mean equal to zero:
- If the error terms are not normally distributed, that means the p-values obtained during the model building become unreliable.
3. Error terms are independent of each other:
- They should not be dependent on one another as a time series.
4. Error terms should have constant variance (Homoscedasticity):
- The variance should not increase or decrease as the error value change.
- Also, the variance should not follow any pattern as the error term changes.
- Changing variance is known as Heteroscedasticity.
Multiple Linear Regression:
As seen in SLR, the model is built using one independent variable, but what if you have more than one independent variable. Multiple linear regression is needed when one variable is not sufficient to create a good model and make accurate predictions. MLR represents the relationship between two or more independent input variables and a response variable.
MLR also uses a linear model that can be formulated in a very simple way. In MLR, the formula gets transformed as follows:
Ideal Equation of MLR:
y = c + m1*x1 + m2*x2 + m3*x3 …… + mn*xn, where
- y is the response/target variable.
- c is the y-intercept or you can consider it as a constant.
- m1 is the coefficient for the first feature.
- mn is the coefficient for the nth feature.
Interpretation of the Equation:
The m values are called the model coefficients or model parameters.
Per unit increase in m for any particular variable (x) is the amount of increase/decrease in the E(y) i.e Expected value of y, or the mean value of y, when other predictors are held constant or the same.
For example, if you have two coefficients m1 and m2, and two independent variables along with the x1 and x2 such as…
m1*x1 + m2*x2, then
Change in the m2 is the amount of change in the mean response of E(y), with per unit increase in x2, provided when x1 is held constant or the same.
New considerations in MLR:
A. Too many cooks spoil the broth:
Model complexity due to overfitting:
A model is ideally said to overfit when the training accuracy is too high while the test accuracy is too low. The model doesn’t generalize and it may end up memorizing the training data.
In middle, it is generalizing well on the data set, if more such data has been introduced the accuracy will not drop. Hence, it is a good fit.
In overfitting, it seems to memorize all the data points in the dataset and hence it is displaying over-fitting, which might not be good if new data is introduced.
Multicollinearity:
This is the contradiction of what we interpreted about the MLR equation, where an increase in the mean response of E(y) when other predictors are held constant due to an increase in one predictor variable (x). This notion becomes void as many a time x-values (predictor variables) have a strong correlation between them or strong association with each other hence becomes difficult to stay constant.
This affects the interpretation power for us to infer right about the model, as p-values become unreliable.
Detecting multicollinearity:
In simple terms, it means detecting strong associations or correlations between the predictor variables. The most common way is to visualize it by drawing a scatter plot or correlation plot pair-wise between all the predictor variables including the target variables. But, it won’t be enough because sometimes a single variable can be associated with maybe 2 to 3 other variables together. So how do we assess this?
The base idea is if some equation can help us calculate how well one independent variable (x1) is explained by all other independent variables combined (x2, x3, x4, ……, xN). Such equation is given by…
Here, we are computing the model for the i-th variable using the other variables. This is known as checking for variance inflation factor (VIF).
So, once you have the r2 for a particular linear model, you can compute the VIF. If VIF is high, we are sure that this variable has an association with other variables. As per the common industry practice, a VIF value > 5 is considered to be high and should be inspected.
Once detected, multicollinearity can be dealt with by,
a. Dropping variables that are highly correlated with each other or picking those that are important for your business and dropping the second.
b. Transforming the original variables into new ones that are interacting with each other by consolidating them into one through feature engineering technique.
B. Feature Selection:
This is nothing but a repetitive task of manually selecting an optimal set of predictor variables from a pool of given features many of which might be redundant till we do not achieve a good fit model.
Creating Dummy Variables:
You will always have a dataset with certain categorical columns. In such columns, you will have n level of categories predefined. But to fit a regression line we would need numeric values and not string values. Hence, we need to convert them into 1s and 0s, where 1 is ‘Yes’ and 0 is ‘No’. When you have a categorical variable with say, n levels, the idea of dummy variable creation is to build n-1 variables, indicating the variables.
Let’s understand this with a help of an example.
Suppose we have a categorical variable, ‘realationship_status’ with three levels namely, ‘Single’, ‘In a relationship’, and ‘Married’. The depiction of such a column with some values would be like,
You would first create three new dummy variables with numeric values like,
As you can see, there is no need to define three different dummy variables. If you drop a dummy variable, say ‘Single’, you will be able to explain all three levels. This is possible with the help of Encoding parameters. Let’s see the table once again, with the first dummy variable, and then learn about Encoding.
As witnessed here, you can see a pattern of numeric values in the table, this pattern is known as Encoding. This encoding against the categorical levels are depicted as follows:
Single -> 00
In a relationship -> 10
Married -> 01
Model Selection:
The fundamental idea behind selecting the right model is you need to maintain a balance between keeping the model simple and explaining the highest variance. This means that would like to keep as many predictor variables as possible which are contributing to the increase or decrease of the target variable.
Suppose you have two different models with a lot many features like 12 variables in one model and 18 variables in another. How do you compare then? Is it fair to use r2? As the model with 18 variables will define a better r2. Also, in some cases, there will be concerns about Multicollinearity.
Hence, between selecting two such models, there will always be a trade-off between explaining greater variance and keeping it simple. There are multiple ways to tackle such a situation used in businesses but we will look at the most common one used i.e “Adjusted r2”. The key idea is to penalize the model for using a higher number of predictors.
Suppose you have r2, but you also account for the number of variables the model is built on and accordingly adjust by penalizing those variables that are insignificant or irrelevant. The formula for Adjusted r2 is given by,
where p = number of predictors
n = sample size
End Notes:
Linear Regression is the foundation for every Machine Learning enthusiast and it is also the right place to start for beginners. It is really a simple but useful algorithm.
This is an educational post made by the compilation of materials from my master’s at IIIT Bangalore, that helped me in my journey. If you would like to put a glance at a Linear Regression project, a complete case study can be found in this GitHub repo.
— — — — To be continued