Understanding Linear Regression
In the Machine Learning world, Linear Regression is a kind of parametric regression model that makes a forecast by taking the weighted normal of the independent features or variables of certain observations and adding a constant value to it called the intercept or bias term.
This implies that basically regression models will be models that have a specific fixed number of parameters/weights associated that rely upon the quantity of features, and they yield a numeric forecast, as for instance the cost of a house.
- ŷ prediction value.
- n refers to the total number of variables/features of the dataset.
- xi is the value of the ith variable.
- Θi are the parameters of the model, where Θ0 is the bias term.
This essentially is an equation of a line. The core idea is to obtain a line that best fits the data. The best fit line is the one for which the residual value also called the total prediction error (all data points) in the dataset are as small as possible ideally as close as 0. Error is the distance between the point of prediction vs actual value in the regression line.
Understanding through an Example
We need to consider the connection between the month to month e-commerce deals and the advertising costs. We have the review results for 7 online stores for the most recent year. The aim is to discover the condition of the straight line that fits the information best. The accompanying table addresses the study results from the 7 online stores.
We can see that there is a positive relationship between the monthly e-commerce sales (Y) and online advertising costs (X).
The positive correlation means that the values of the dependent variable (y) increase when the values of the independent variable (x) rise.
Thus, in the event that we need to anticipate the month to month e-commerce sales from the advertising costs, the higher the benefit of publicising costs, the higher our expectation of deals. In the event that you attempt to take a gander at the plot of the previously mentioned information we would notice the following.
The Scatter plot shows the amount one variable influences another. In our model, above Scatter plot shows how much internet publicising costs influence the month to month online business deals. It shows their relationship. In the event that we attempt to fit a best fit line on it for example a regression line dependent on the algorithm, the chart turns out in the below mentioned way.
The linear regression plans to track down the best-fitting straight line through the data points. The best-fitting line is known as the regression line. In the event that information focuses are nearer when plotted to making a straight line, it implies the connection between’s the two factors is higher. In our model, the relationship is solid. The connection can be positive as is in the given model, however can be negative too.
The blue inclining line in the above graph is the regression line and shows the anticipated score on internet business deals for every conceivable estimation of the advertising costs.
Getting into the Math
In the above equations,
y = the target variable.
x = the input variable.(x1,x2,…xp in case of multiple regression)
b0 = the intercept value also called bias term.
b1, b2,.. bp = Coefficients describing the linear relationship between a combination of the input variables and target variable.
Effect of b0 on x-y relationship:
The estimation of b0 characterizes the expectation of the mean of target (y) when input (x) is 0. In the event that we dole out b0 to be 0, we are coercively attempting to pass the line through the origin, where both x and y are 0 and 0. This might be useful sometimes, while decreasing accuracy value in others. Consequently, it’s essential to explore different avenues regarding the estimations of b0, according to the requirements of the dataset you’re working with.
Effect of b1 on x-y relationship:
In the event that b1 is more prominent than 0, the information variable emphatically affects the objective. The increment in one would prompt an expansion in the estimation of the other. Though, if b1 is under 0, the information and yield factors will have a contrarily corresponding or negative relationship. Hence, an increment in one would prompt a decline in the estimation of the other.
Expect we are given a dataset as plotted by the ‘x’ marks in the plot above. The point of the straight line i.e. the regression line is to discover a line like the blue line in the plot mentioned earlier that fits the given arrangement of preparing model best. This line in actuality is an effect of the parameters θ0 and θ1. So the target of the learning calculation is to track down the best parameters to fit the dataset. In other words, pick θ0 and θ1 so that hθ(x) is near y for the preparation models (x, y).
Learning objective is then to minimise the value of the cost function i.e.
This cost function is also called the squared error function because of obvious reasons. It is the most commonly used cost function for linear regression as it is simple and performs well.
Some Major Metrics for Model Evaluations
We realise that regression line attempts to fit a line that creates the least distinction among anticipated and real qualities, where these distinctions are impartial too. This distinction or error is otherwise called residual.
Residual = actual value — predicted value
It is imperative to take note of the fact that, prior to surveying or assessing our model with assessment metrics like R-squared, we should utilise residual plots.
Residual plots uncover a biased model, more than some other assessment metric. On the off chance that your residual plots look ordinary, feel free to, assess your model with different measurements.
Remaining plots show the residual quantities on the y-axis and anticipated qualities on the x-axis. On the off chance that your model is biased you can’t confide in the outcomes. Notwithstanding, on the off chance that there are any indications of a deliberate example, then your model can be called biased.
Mean Squared Error (MSE)
The most widely recognised measurement for regression tasks is MSE. It has a structure that is convex in nature. It is the mean of the squared difference of actual and predicted values. Since it is differentiable and has a curved shape, it is simpler for optimisations. MSE punishes large values of errors.
Mean Absolute Error (MAE)
This is just the mean of absolute quantity of difference between the target and predicted values anticipated by the model. Incase of outliers, this metric is not much preferred. MAE doesn’t punish huge values of errors.
R-squared or Coefficient of Determination
This metric addresses the variance of the target variable clarified by the independent variables of the model. It estimates the strength of the connection between your model and the reliant variable.
To comprehend what R-square truly addresses let us consider the following situation where we measure the error of the model with and without the information on the independent variables.
In the event that R² is high (say 1), the model addresses the variance of the reliant variable.
In the event that R² is low, the model doesn’t address the variance of the dependent variable and regression line is no better than taking the mean line as you are not utilising any data from these dependent variables.
Root Mean Squared Error (RMSE)
This is the under root of mean of the squared difference of the actual value and the model predicted value
R-squared mistake is superior to RMSE. This is on the grounds that R-squared is relative while RMSE is an absolute measure of fit.
Fundamentally, RMSE is only the foundation of the normal of squared residuals. We realise that residuals are a proportion of how inaccessible the focuses are from the regression line. Hence, RMSE estimates the scatter of these residuals.
Adjusted R-squared
The main difference between adjusted R-squared and R-square is that R-squared describes the amount of variance of the dependent variable represented by every single independent variable, while adjusted R-squared measures variation explained by only the independent variables that actually affect the dependent variable.
In the equation above, n is the number of data points while k is the number of variables in your model, excluding the constant.
R² will in general increment with an increment in the quantity of independent variables. However, this could be misdirecting. Accordingly, the changed R-squared punishes the model for adding more independent variables (k) that don’t fit the model.
Assumptions taken for Linear Regression
1. The regression model is linear in the coefficients.
This presumption tends to the practical type of the model. In stats, a regression model is linear when all terms in the model are either the intercept value or a parameter multiplied to the independent variable. You construct the model exclusively by adding the terms together. These standards oblige the model to one sort:
In the equation, the betas (βs) are the parameters that OLS estimates. Epsilon (ε) is the random error.
2. The error term has a population mean of zero
The error represents the variation in the target variable that the independent factors don’t clarify. The error term should be accounted for by randomness. For your model to be unprejudiced, the mean estimation of the error should approach zero.
Assume the normal error is +7. This non-zero normal error shows that our model efficiently under-predicts the noticed qualities. Analysts allude to orderly blunder like this as a bias, and it implies that our model is deficient on the grounds that it isn’t right if we consider average.
3. The error term has a constant variance (no heteroscedasticity)
The variance in the erroneous quantities ought to be steady for all perceptions. As such, the fluctuation doesn’t change for every perception or for a scope of perceptions. This condition is known as homoscedasticity. In the event that the variance varies, we allude to that as heteroscedasticity.
The easiest way to check this assumption is to create a residuals versus fitted value plot. On this type of graph, heteroscedasticity appears as a cone shape where the spread of the residuals increases in one direction. In the graph below, the spread of the residuals increases as the fitted value increases.
4. All independent variables are uncorrelated with the error term
On the off chance that an independent variable is associated with the error, we can utilize it to foresee the error, which contradicts the thought that the error addresses unusual irregular mistakes. We need to figure out how to incorporate it in the regression model.
This supposition is additionally alluded to as exogeneity. At the point when this kind of connection exists, there is endogeneity. Infringement of this supposition can happen on the grounds that there is synchronisation between the target and dependent variables or estimation mistake in the autonomous factors.
What are some advantages and disadvantages of linear regression?
In regression with multiple independent variables, the coefficient tells you how much the dependent variable is expected to increase when that independent variable increases by one, holding all the other independent variables constant.
One major issue with the linear regression algorithm is the fact that its extremely prone to the problems of under-fitting and overfitting. To understand this we need to understand about Bias and Variance in detail.
Bias — What exactly do you mean by Bias?
Bias is the contrast between the mean forecast of our model and the right value which we are attempting to predict. Model with high bias gives almost no importance to training data and misrepresents the model by simplifying it extensively. It generally prompts high blunder on both the train and test sets.
Variance — What exactly do you mean by Variance?
Variance is the variation or spread of model prediction values across data points with respect to the mean value for a given information point which discloses to us spread of our information. Model with high variance tells us immense dependence on train set and doesn’t generalise the information which it hasn’t seen previously. Thus, such models perform very well on train information yet has high error rates on test information.
These 2 factors result in what we understand as fitting problems in the world of machine learning.
- In supervised learning, under-fitting happens when a model unable to capture the underlying pattern of the data. These models usually have high bias and low variance. It happens when we have very less amount of data to build an accurate model or when we try to build a linear model with a nonlinear data. Also, these kind of models are very simple to capture the complex patterns in data like Linear and logistic regression.
- In supervised learning, overfitting happens when our model captures the noise along with the underlying pattern in data. It happens when we train our model a lot over noisy dataset. These models have low bias and high variance. These models are very complex like Decision trees which are prone to overfitting.
Bias-Variance Tradeoff
On the off chance that our model is excessively straightforward and has not very many parameters, it might have high bias and low variance. Then again on the off chance that our model has enormous number of parameters, it will have high variance and low bias. So we need to track down the right equilibrium without overfitting and under-fitting the information. This tradeoff in intricacy is the reason there is a tradeoff among bias and variance. An algorithm can’t be more perplexing and less unpredictable simultaneously.
Error Value = Variance + Bias² + Irreducible error
An optimal balance of bias and variance would never overfit or under-fit the model.
The significance of Regularisation process?
If our model complexity exceeds this sweet spot, we are in effect overfitting our model; while if our complexity falls short of the sweet spot, we are under-fitting the model. With all of that in mind, the notion of regularisation is simply a useful technique to use when we think our model is too complex (models that have low bias, but high variance). It is a method for “constraining” or “regularising” the size of the coefficients (“shrinking” them towards zero). The specific regularisation techniques we’ll be discussing are Ridge Regression and Lasso Regression. Most famous Regularisation techniques used to address over-fitting and feature selection are:
1. L1 Regularisation
2. L2 Regularisation
A regression model that uses L1 regularisation technique is called Lasso Regression and model which uses L2 is called Ridge Regression.
The key difference between these two is the penalty term.
Ridge Regression
Ridge regression adds “squared magnitude” of coefficient as penalty term to the loss function. Here, the highlighted part represents L2 regularisation element.
Here, if lambda is zero then you can imagine we get back OLS. However, if lambda is very large then it will add too much weight and it will lead to under-fitting. Having said that it’s important how lambda is chosen. This technique works very well to avoid over-fitting issue.
Lasso Regression
Lasso Regression (Least Absolute Shrinkage and Selection Operator) adds “absolute value of magnitude” of coefficient as penalty term to the loss function.
Again, if lambda is zero then we will get back OLS whereas very large value will make coefficients zero hence it will under-fit.
ElasticNet Regression
Elastic Net first emerged as a result of critique on lasso, whose variable selection can be too dependent on data and thus unstable. The solution is to combine the penalties of ridge regression and lasso to get the best of both worlds. Elastic Net aims at minimising the following loss function:
where α is the mixing parameter between ridge (α = 0) and lasso (α = 1).
Now, there are two parameters to tune: λ and α. The glmnet package allows to tune λ via cross-validation for a fixed α, but it does not support α-tuning, so we will turn to caret for this job.
The key difference between these techniques is that Lasso shrinks the less important feature’s coefficient to zero thus, removing some feature altogether. So, this works well for feature selection in case we have a huge number of features. Traditional methods like cross-validation, stepwise regression to handle overfitting and perform feature selection work well with a small set of features but these techniques are a great alternative when we are dealing with a large set of features.
Conclusion
In this blog we understood in depth the concepts of linear regression through a very intuitive example. We dived deeper into the math behind it understood the bias terms the parameters associated to each features cost function minimisation etc. Further we went on to understand the various metrics used to evaluate our models. Understanding the various advantages and disadvantages we came across the concepts of variance and bias which are the major source of errors in any LR model and hence understood how to optimise our models through various regularisation techniques that are used in the industry.
