Detecting and Resolving Overfitting

Seth Larweh Kodjiku
unpack
Published in
3 min readOct 26, 2020

Overfitting is a popular word in machine learning, statistics and neural networks at large. It’s one of the most significant concepts of data science we experience after our machine model have been trained. Overfitting is an error that occurs in modelling when the model learns the details, patterns and the noise of a dataset very closely to the degree that it contrarily impacts the model’s performance.

This implies that the noise and arbitrary changes in the training set are adopted and learned as model concepts. The challenge is that these concepts learned by the model do not apply to other sets of data and its inability to generalize its predictive power to new datasets.

To put that another way, on account of an overfitting model it will frequently show incredibly high precision on the training dataset yet low accuracy on new data gathered and processed through the model later on. That is the meaning of overfitting in a nutshell, yet we should go over the idea of overfitting in more detail. We should investigate how overfitting happens and how it tends to be kept away from.

Fitting and Underfitting

Before we delve profoundly into overfitting, it may be useful to investigate the idea of underfitting and “fit” by and large. At the point when we train a model, we are attempting to build a framework that is equipped for predicting the class, or nature, of items inside a dataset, in view of the features that depict those items. A model ought to be able to clarify a pattern inside a dataset and foresee the classes of future data points dependent on this pattern. The better the model clarifies the connection between the features of the training set, the more “fit” our model is.

A model that ineffectively clarifies the connection between the features of the training data and in this way neglects to precisely classify future data examples is underfitting the training data. If you somehow were to draw a graph to show the predicted relationship of an underfitting model against the actual convergence of the features and labels, the predictions would veer off the mark. On the off chance that we had a graph with the actual data points of a training set labelled, a seriously underfitting model would radically miss the vast majority of the data points. A model with a better fit may carve a way through the centre point of the data points, with singular data points being off of the predicted values by just a bit.

Underfitting can frequently happen when there is insufficient information to create an accurate model, or when attempting to design a linear model with non-linear data. Additional training data or more features will frequently help lessen underfitting.

Controlling Overfitting

There are a couple of distinct ways that one can control overfitting. One technique for lessening overfitting is to utilize a resampling strategy, which works by assessing the accuracy of the model. You can likewise utilize a validation dataset in addition to the test set and plot the training accuracy against the validation set rather than the test dataset. This keeps your test data unrecognized. A well-known resampling strategy is K-folds cross-validation. This procedure enables you to isolate your data into subsets that the model is trained on, and afterwards the performance of the model on the subsets is investigated to assess how the model will perform on outside data.

Utilizing cross-validation is probably the most ideal approaches to assess a model’s precision on concealed data, and when joined with a validation dataset overfitting can regularly be kept to a minimum.

--

--