MODELING THE SIGNAL, REGULARIZATION IN MACHINE LEARNING

Airton Kamdem
5 min readFeb 28, 2022

--

The balance between accounting for signal and noise in a dataset is a tricky and surprisingly artistic endeavor that any good data scientist should be able to navigate. In this case, we are defining noise as the random data points within a dataset that are not aligned with the broader patterns or trends of the data. These types of data points tend to have minimal predictive power on outcomes and can often derail or distract a model, hence why they’re known as Noise. The visual above will be a lasting fixture in this discussion as we explore the tools we have at our disposal to minimize noise in modeling.

Regularization, in machine learning is the process of adjusting your model’s fit to the training data. There are a handful of good reasons why machine learning engineers should want to tinker with this as a tool. In many regards, it is tantamount to tuning a guitar so that you can achieve the perfect sound for any given occasion. In machine learning, as you often employ different types of models and manufacture different types of features in service of distinct business and research goals, it does become clear over time that no single out of the box model will respond perfectly to every data set or problem you pose, and there-in lies an artistic opportunity in a data scientist’s job.

As with most linear regression models, there is a guiding coefficient connected to each model’s equation and output — this coefficient is essentially the variable that we are managing and tuning as we try to engineer a model that will adequately fit onto our training data. The tuning process requires fiddling with bias and variance to adjust a model’s fit. The graph above provides a visual representation of this regularization concept, while the equation below presents a corresponding quantitative outlook.

Y ≈ β0 + β1X1 + β2X2 + …+ βpXp

In this context, we’ll discuss regularization as it relates to Ridge and Lasso specifically. When regularizing in Ridge, our coefficients will continuously shrink towards zero but never actually reach zero. With Ridge, you can also import and use a RidgeCV, and force it to test a wide range of different alphas using a query such as ‘np.linspace(.000001,5,1000)’ as the alpha, forcing the model to test a 1000 alpha values between 0 and 5 in this case.

In Lasso, which stands for Least Absolution Shrinkage and Selection Operator, the model follows a slightly different regularization path where it will approach zero, and as soon as it reaches this point the feature will be permanently removed from your model — a key distinction from Ridge. The way this plays out in practice is that the more you increase the alpha, the more features you’re effectively cutting out of the model, and a simple ‘lasso.coef_’ query can give you a real-time look at how many coefficients have been zeroed out due to your alpha shifting.

Towards this end, it is important to keep in mind that the fitting process utilizes a loss function (Residual Sum of Squares), which we are constantly minimizing through the coefficients we choose. As our model discovers noise or random inputs that don’t follow a broader pattern in our data, the determined coefficient will struggle to adjust to them within the training data set and even more so within a testing data set. In these scenarios, we want to be able to increase the flexibility of the model in order to help it capture more of the ‘noisy’ data points. This is done best by minimizing the coefficients and by extension, the total function to decrease the association of variables with the data’s response. For example, taking a look at the Y-axis, when there isn’t much movement, we’re essentially not applying any penalty and the provided RSS estimates are exactly equal to the least-squares but as we increase the penalty and the coefficient, the impact of the overall fitting function becomes more pronounced. Because selecting a good and healthy value for the coefficient can be difficult, it can often be valuable to leverage cross-validation to learn from similar models that have gone through the fitting and adjusted their coefficients accordingly. Within regression models, for example, we use L2 and adjust the ‘C’ value to enable this process.

Regularization can look and behave differently within different types of models. Previously we touched on its inherent pattern within most regression patterns and in another blog post we’ll touch on the process and importance of this ritual within outstandingly powerful models such as Neural Nets but to revisit the point on variance versus bias in yet another format, what we often notice within Lasso and Ridge models is that the lambda function ends up being the tool that needs to be tuned, and as its value increases, it reduces the value of coefficients and variance, therefore increasing bias on a model that may be too well adjusted to training data — this curtails overfitting and facilitates better generalization on new data. Overdoing this however can lead the model to lose significant context and re-introduce a hefty amount of bias and therefore underfit.

As you may have gathered so far, it comes to understanding the relationship between variance and biases in regularization, the general formula regardless of the model is that high variance and low bias usually leads to a model that is much too closely aligned to data points on the training data, whereas low variance and high bias usually means that the model has not been able to learn much of anything from the training set

Outside of regularization, other approaches one can take to minimize overfitting include reducing the number of features broadly, or more selectively reducing the number of features with a high signal and influence toward the model, additionally, you could simply use a model that is less powerful, or limit the range of the data set, in a way that maintains its integrity but perhaps prevents overfitting. All these steps are essentially manual representations of what the regularization steps we discussed here would do for you automatically, perhaps it's best to trust the machines.

--

--