ML Core Concepts: Ridge Regression
On Day 1, we discussed the limitation on Linear Regression of needing more rows of training data than independent variables. However in many real-world scenarios you may not have enough data relative to the amount of information you have about each data point. All of that information may be relevant and should be independent, and so after feature selection and feature reduction (both of which will be explored in later posts) you may still have more features than rows. Enter: Ridge Regression.
What Problem Does It Solve?
Ridge Regression prevents the coefficients of the regression model from overfitting by applying a penalty on their size. This is known as regularization. As the penalty increases, the coefficients approach zero. This is useful for small or noisy data sets (remember that every feature adds another opportunity for noise) to improve how well the model generalizes to new input data.
We can also phrase this as reducing the variance. Model variance comes from the fact that you are always training the data on a sample of all possible data it could (in theory) be trained on. With changes to the sample, the model changes, causing variability in the predictions it makes. More complex models, such as those with a large number of features, have greater variance. By using regularization with Ridge Regression we reduce the variance without increasing the bias of the model.
The Math
Because Ridge Regression is simply Linear Regression with a penalty applied to the coefficients, the mathematical model for calculating predictions is the same. The difference lies in how the coefficients are calculated. Specifically, an identity matrix multiplied by the regularization strength is added to the XTX term: (X^T * X + I*λ)^(-1)*X^T*y. In code this is:
import numpy as npXTX = np.dot(X.T, X) + lambda*np.identity(X.shape[1)
beta = np.dot(np.linag.inv(XTX), np.dot(X.T, y))
This approach implements what is known as the L2 norm. The L2 norm, or Euclidean norm, is one method for calculating the magnitude of a vector. For a simple 2D space, it is the hypotenuse of the triangle formed by the values of the X and Y axes —the familiar c = sqrt(a² + b²). Extrapolating from that to a higher-dimensional space, this becomes the square root of the sum of all of the squared values of the vector.
How is Ridge Regression derived from the L2 norm? At a high level, regression models are solved by minimizing a loss function. One loss metric used in regression problems is the “root mean squared error” (RMSE), or the square root of the mean of the squares of the error in each prediction. The L2 norm in Ridge Regression minimizes the RMSE.
Assumptions and Limitations
Ridge Regression has the same fundamental assumptions as other linear algorithms — in particular that there is a linear relationship between the independent and dependent variables. This can cause it to perform poorly on data sets that have non-linear relationships. We call this a biased model — its strict assumptions add a systematic error to the predictions.
The regularization term can be tricky to tune. Scikit-Learn has a RidgeCV class that you can use to select the value through cross validation. In their example, the candidate values range from 0.000001 to 1,000,000, and in real-world applications the term can have that much variation between different use cases. However, generally the more training data available, the lower the regularization strength can be. As the term approaches infinity, the coefficients approach zero, and as the term approaches zero, the coefficients approach what would be arrived at with Linear Regression.
Though this isn’t necessarily a limitation of Ridge Regression, a strength of the next algorithm in this series — Lasso — is that it can be used to reduce the number of features. With Ridge Regression no features are eliminated, they simply have decreased coefficients.
Further Resources
Penn State STAT 508: Applied Data Mining and Statistical Learning Course Notes
Machine Learning Mastery: Gentle Introduction to Vector Norms in Machine Learning
Towards Data Science: Visualizing regularization and the L1 and L2 norms
Ridge regression and other key concepts for linear models in real-world settings are covered in my Machine Learning Flashcards: Linear Models deck. Check it out on Etsy!