Abstruse problem with an easy solution

Overfitting And How to Tackle It

About Overfitting in the simplest way and its solution -Regularization

Siddharth Varangaonkar
8 min readMay 16, 2019
source:euclidean.com

Oh ! how difficult would it be to fit in that tight shirt! We have to find a way to tackle it right? Yes, we have to, and build our machine learning model to get in the Goldilocks zone.

So let's understand it more through real-world problems…

What happens if you have worked hard in studying your syllabus. You memorized every single line of every subject book. Now You know everything from your syllabus and then you scored 99% in your final exam. But when you got a job and were placed in a new company then you were not able to cope up with the real-world problems and hence giving an average output as an employee. This real-life example quite sums up overfitting where we cannot generalize to real-world problems rather than a bookworm study.

Overfitting is the trickiest hurdle and most unseen problem in machine learning. Overfitting teaches us that what importance generalization of a problem has in the real world and in the development field too. Overfitting occurs in the real world almost all the time. Just switch on the television and you may find it right there.

source:devrant.com

With Great precision comes great overfitting! Uncle Ben said…

Hence rather than to get the best precision we should focus more on the generalization of the problem.

Let's understand it a bit more in machine learning terms to get to the ‘just right’ level for best accuracy on the unseen label too.

towardsdatascience.com

Generalization refers to how well the concepts learned by a machine learning model applied to specific examples not seen by the model when it was learning.

The goal of a good machine learning model is to generalize well from the training data to any data from the problem domain. This allows us to make predictions in the future on data the model has never seen.

Assume we have trained a model of 10,000 students to predict their future score from their past scores data. After training the model, we test it and it gives us 99% accuracy … Wow! that's awesome. The visualization of the trained model is somewhat like this in the image[1]. Here, image[1] shows us a classification model which is classifying the red and blue points which are our datasets given. The Green line shows the overfitted model line which does not generalize our model. And the black line fits the data well to predict the best output results.

image[1] source:towardsdatascience.com

When we try to predict with some ‘unseens labels or new data’ on the same trained model then we get 50% accuracy. That's where the tricky problem ‘Overfitting rises’.

Now a big question may arise that why it happens? let us find it out...

To understand why and the reason more clearly let’s learn some basic terms in brief which are bias and variance. Bias gives us how closeness is our predictive model’s to training data after averaging predict value. Quite simply Bias can be as having biased towards some people or some opinion. If we are highly biased we are going to make wrong assumptions about those. So, Model with high bias pays very little attention to the training data and oversimplifies the model that's what leads to underfitting. No flexibility to learn. Bias is mostly due to much data too fewer features to reason for the data. Variance is the variability of model prediction for a given data point or a value which tells us the spread of our data. Variance focuses more on Spread of data than on generalizing our model for the given features. It follows noise in data too closely rather than following the real signals. So we can say High Variance leads to overfitting. Variance is mostly due to too fewer features more data.

source:EDS

So if you understand bias and variance, you know that High Variance and Low Bias lead to ‘Over-Fitting’.

To detect Overfitting :

  1. Split Your initial dataset in training and test subsets i.e. dividing the dataset into a particular ratio for validation and accuracy detection.
  2. Use the training data to train and tune your models. Don't touch the test data till the end.
  3. If our model does much better on training set say 99% accuracy and fails on test set say 50% accuracy. Then we’re likely overfitting. Just Compare test set and training set which will help you to detect overfitting.

To learn more about the problem of overfitting in terms of the cost function, refer to this The problem of overfitting by Sir Andrew N.G. After we know the troublesome problem lets solve it by putting our hand in the snake pit and eliminating it.

Regularization- Fix of overfitting:

Let's balance our seesaw and make our model stronger to get the most accurate results.

We need to choose the right model in between simple and complex model. Regularization helps to choose preferred model complexity, so that model is better at predicting. Regularization is nothing but adding a penalty term to the cost function and control the model complexity using that penalty term. It can be used for many machine learning algorithms. That's how Regularization will help us.

So let's understand this in an easy way.

For a machine learning algorithm, we have a cost function, Gradient descent, Features, Datasets and many more. We just need to remould the cost function in our machine learning model. Cost Function is the way you measure your model's accuracy by comparing values between your predicted and original values. The cost function is an expression depicting accuracy. It is also known as the loss function.

Let our Cost Function be C(X, Y) where X be our design matrix ( Design matrix is the basic data object on which machine learning algorithms operate). And Y is our Target vector which stores the actual results to train the model.

Cost Function=C(X,Y)…(i)

So Now we train our model using this cost function and gradient descent. We get the results which sticks too much to data and is inaccurate on the Test set. It's overfitting. Now the regularization technique will come in handy to generalize our model.

What we will do is penalize our loss function by adding a multiple of summation of your weights. Weights are the assigned values to features by training of the model. How much weight does this feature carry? i.e. this feature will affect the model by how much. For example, if you have trained a model for face recognition then you will have features to train like complexion, width, height, the position of eyes, jaw size, etc. Then the weight of every feature will be assigned after training. Here width and height matter the most hence those features will have the most weight. So that's what weight is.

Now let's get back to Regularization…

So to penalize our cost function we add a multiple of summation of all the weights. To the cost, function add the multiple of summation of weights, which reduces the degree of our fitting polynomial. In simpler terms, it will select the important features and delete the unnecessary features helping in overcoming overfitting.

So now let's bring in that multiple which we call it as the regularization parameter. ‘ λ’

Model=∑C(actual,predicted(Model))

So this is our model till now,

Model=∑C(actual,predicted(Model))+λ∑W(Model)

So let's understand in a bit detail that how ‘ λ’ helps us and methodology of regularization :

  1. The data we fit is denoted by a polynomial equation. The polynomial equation in the right in the image[1] is the overfitted higher degree polynomial. So we will try to reduce its degree or bring it to an optimal degree with the help of ‘ λ’.
source: Andrew NG coursera image[1]

2. Now we have to select the value of ‘ λ’ for our model. One approach you can take is to randomly subsample your data a number of times and look at the variation in your estimate. Then repeat the process for a slightly larger value of lambda to see how it affects the variability of your estimate. Keep in mind that whatever value of lambda you decide is appropriate for your subsampled data, you can likely use a smaller value to achieve comparable regularization on the full data set.

3. So multiplying the ‘ λ’ multiple with the summation of weights tries to minimize the weights near to zero. As we know Cost function tries to minimize it as much as possible according to the model with the help of Gradient Descent. So when we add a large multiple summation ‘ λ’ values, it tries to minimize its values leading to unnecessary features to zero. If you know how cost function and gradient descent works you will understand this.

This is all you will need to fix overfitting with help of regularization in the simplest way. It is a useful technique that can help in improving the accuracy of our model. There are some other methods too which you can go use for overfitting:

  1. Cross-validation
  2. Pruning — Mostly used in decision trees
  3. Early stopping for Deep Neural Networks

So getting into goldilocks zone was quite easy! Right?

This article was for a general basic understanding of overfitting, its effects and how to fix it using regularization. I explained to you with an example of supervised learning in the simplest way. Regularization concept is much deeper than this. But just to remove the problem you can refer to this article. To get into deep concepts of Regularization, You can refer to this L1 and L2 regularization.

--

--

Siddharth Varangaonkar

Philomath | Pedant | Pythonic | Flask | A.I. and machine learning geek | Implementation over theory | Writer Exploring Machine learning and its core |