Overfitting… you probably already heard this somewhere, right? It seems like your model is trying so hard to model your training data that it ends up capturing all the noise (points that do not represent your true data, the outliers).
If you already heard about overfitting you probably also heard of the trade-off between bias and variance (check this post for more details).
Regularisation: why do I use you?
Ordinary Least Squares
Before we go further, let me just make it clear that Ordinary Least Squares (OLS) is not a regularisation method; it is a type of linear least squares methods for estimating the unknown coefficients in a Linear Regression model. The problem with this method is that, when having more than one coefficient, there may be high correlation between them that in turn will give very high variance to the model, make it overfit our training data. Check this post for more details on the negative aspect of high variance.
Hence, we wish to control our parameters values, we do not wish them to grow exponentially, out of boundaries. This is the issue with OLS when the coefficients are correlated, they can become abnormally big. A good solution is to put a limit of growth to these coefficients, i.e. regularise our model.
Regularisation: Here’s why I need you!
So, a good way to reduce overfitting/variance is to regularise the model (i.e. constrain it). Regularisation is a form of regression that constrains your model to a few degrees of freedom regularising coefficient estimates towards zero and ensuring it has a harder time to overfit the data. For example, a way to regularise a polynomial model is to reduce the number of polynomial degrees.
This technique discourages learning a more complex or flexible model, so to avoid overfitting, by adding to the cost function a penalty that gets larger as the coefficient
theta gets larger. The higher the importance is given to the penalty term, the more we discourage large coefficients. Before presenting the types of regularisation, let’s remember some points on linear regression important to understand the following sections.
So, as you probably know a linear model makes a prediction by simply computing a weighted sum of the input features (coefficients/parameters), plus a constant called the bias term or intercept term.
- y^ is the predicted value
- n is the number of features
- xi is the ith feature value
- θj is the jth model parameter (including the bias term θ0 and the feature weights θ1, θ2, ⋯, θn).
Now, if you remember your past lectures on Linear Regression you know that training a model means setting its parameters in order to best fit the training set. Nevertheless, in order to measure the sucess rate of our approximation to the reality we need what we call a cost function. The most common performance measure of a Regression model is Root Mean Square Error (RMSE). As you might have guessed, in order to find the best Linear Regression model you need to find the coefficient that minimizes RMSE. In practice, it is much simpler to minimize the Mean Square Error (MSE) than the RMSE, and it leads to the same result.
This cost function allows us to adjust the coefficients based on our training data. Now, image that our data has noise, then the coefficient estimation won’t generalise that well for future data. Hence, it is at this moment that regularisation takes action and ensures the shrinkings of the impact of these estimates towards zero.
Ridge Regression (also called Tikhonov regularisation) is a regularised version of Linear Regression. For this type of regularisation the penalty term is added to the cost function, ensuring the algorithm fits the data and keeps the model coefficients weight to a minimum.
Remember, that his regularisation term must only be added to the cost function during training. Once your model has been trained you should evaluate the model’s performance using unregularised performance measure.
The hyperparameter α controls how you wish to regularise the model, in other words penalize the model’s flexibility. The increase in flexibility is represented by an increase in its coefficients, and if we wish to minimise the above function, then these values need to be small.
With α = 0 the model equals the normal Linear Regression while, on the other hand, if α →∞ (very large values) then the coefficient estimates end up very close to zero and the result is a flat line through the data’s mean.
One important note is that you should scale the data (e.g. using
StandardScaler() ) before executing the Ridge Regression since it is sensitive to the scale of input features.
The Ridge regression has one main disadvantage, it includes all n features in the final model. The hyperparameter α will shrink the coefficient towards zero but will not set any of them exactly to zero (unless α = ∞). This is not good when dealing with data sets with large number of variables, because Ridge Regression will always create a model involving all features.
On the other hand, the Least Absolute Shrinkage and Selection Operator Regression (simply called Lasso Regression) overcomes that disadvantage. In opposite to Ridge Regression it only penalises high coefficients since it uses |θj|(modulus) instead of squares of θ, as its penalty. In Statistics, this is known as L1 norm (check this video for more enlightenment).
In opposite to the Ridge Regression, Lasso has the effect of forcing some coefficient estimates to be exactly zero when hyperparameter θ is sufficiently large. Therefore, one can say that Lasso performs variable selection producing models much more easier to interpret than those produced by Ridge Regression.
Another Formulation for Ridge and Lasso Regression
Lets take a look at the above methods with a different perspective. Let’s consider we have two coefficients we want to estimate in a given problem. The Lasso Regression coefficients’ estimate have the smallest MSE out of all points that lie within the diamond defined by |θ1| +|θ2| ≤ s. On the other hand, Ridge Regression coefficients’ estimates have the smallest MSE out of all points that lie within the circle defined by θ1 ² + θ2 ² ≤ s.
The image below describe the mentioned equations:
Looking at the above image, in green, we have the constraint functions (see the vector norm video to understand), for Lasso on the left and Ridge on the right along with contours for MSE (red elipse). If s is large enough, so that the green areas will contain the center of the elipse, for both regularised regression the results will be equal to the ordinary least squares estimates. However, looking at the example of these image, either Lasso or Ridge coefficients’ estimation are given by the first point at which the ellipse contacts with the contraint region (green area).
Now, as we know the great difference between both methods in on the Ridge’s disasvantage to consider all predictors. This is due to the fact that Ridge Regression has a circular constraint with no sharp points, hence the intersection between the ellipse and the constraint area will usully not occur on an axis making therefore θ1 and θ2 usually not equal to zero. In opposite, Lasso regression has corners at each of the axes making it more common for the ellipse to intersect the constraint region at an axis. Thus, most of the times one of the coefficients will be equal to zero. In higher dimensions (we only saw two dimension spaces), both coefficients can have value zero.
Regularisation: It’s cool but be aware!
As stated in the beginning, often standard least squares model tend to have some variance associated making it not totally have to generalised to any training data set. This is where regularisation comes in handy, since ir reduces significantly the variance of the model without substantial increase in the bias.
Nevertheless, by tuning the hyperparamenter θ we should also be aware of the impact this might have on the final model. For example, as θ increases the values of the coefficients tend to reduce as well as the variance. However, if we continue to increase θ it might come to a point where the model starts to lose important properties, giving rise to bias and therefore underfitting. Hence, one should be careful when selection the value of the hyperparameter θ.
Don’t forget, if you like it, please give it an applause!