Overfitting is a problem that comes up when a model is used to make predictions. I will first describe model fit, explain how fit relates to model prediction, then give an example and finish with a few attributes of the overfitting problem.

Let’s say we have a dataset on which we generate a model estimating one of the variables with the others. Then we have a model created on our “generation dataset” which gives expected values of our “estimating variable.” The more our model accurately estimates the estimating variable (or stated differently, gives accurate expected values) the better the model fits the generation dataset. The model could be created a billion different ways using techniques from statistics, machine learning or theoretical knowledge of the phenomena captured in the variables. Some methods may give us better fitting models on our generation dataset than others. Machine learning techniques will give models that better fit the generation dataset than models derived from theory, but as we will see below machine learning techniques will also leave us with more overfitting than theory.

Regardless of “how” we created the model, we probably did so to predict the estimating variable given different data or combinations of variable values we might see in the future. In other words, we want to use our model to predict the estimating variable outside the generation data. We will find the model valuable if it can accurately predict the estimating variable in new situations because it gives us more knowledge on what we should expect. This is fundamentally how all models make their living 😊

For example, a model generated on data related to last month’s weather which predicts the temperature at 5pm is helpful when we use it to get an informative expectation of tomorrow’s or the following day’s 5pm temp. But because the situations in which we want to predict and those on which our generation data are based are different, we will find our model performs differently between the two. In particular, the model will be much better at estimating accurate values of the estimating variable (or we would say, has better fit) on the generation dataset than on any other outside dataset. The degree to which the model is better on the generation dataset than on any outside dataset can vary and when the model is much better on the generation dataset we say we have overfit the model or have an overfitting problem.

We care about the overfitting problem because having accurate predictions on outside data is valuable and overfitting means your predictions could not be all that accurate. We may try to increase the fit of the model on estimating in our generation data but doing this will only create a poorer fit on outside data (unless the outside data is perfectly captured in the generation data). We will always have less predictive power and worse fit on new data than we have on the generation dataset. While this is a problem itself, we also do not know how much less predictive power (or how much worse the fit) in new data. If we don’t know how much the outside data will differ from the generation data then we cannot tell how much we overfit our model.

Originally published at ablifeing.blogspot.com.

Interests are statistics, data, the organization of work, evolution and using science practically.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store