Overfitting: an essential problem in the development of AI systems
In previous articles on ensemble learning and cross-validation, I have repeatedly emphasized the importance of increasing model robustness and preventing overfitting in AI systems.
In this article, I’d like to briefly discuss what “overfitting” is. It’s a very essential and vexing problem in building machine learning-based AI systems, and you may always want to think of workarounds carefully to prevent it.
Overfitting s a common problem that occurs in both of the two main fields of machine learning, “supervised learning” and “unsupervised learning”. (e.g., there is noise in the values representing the data, such as duplicated values.) However, when considering overfitting, the case of supervised learning is easier to understand, so we will use the example of supervised learning.
In supervised learning, the laws are derived from sample data given in advance, that is regarded as “information from the teacher.” Deriving the laws is generally called “Model Fitting.” Overfitting means literally the state of model fitting too much. It derives laws that are extremely faithful to the sample data, i.e., it creates a model that fits the sample data too well. In that case, the sample data can be processed perfectly, but the AI is not flexible enough to deal with the data that will be generated in real production, and the AI will not be accurate at all in the real business. Using the example of studying for an entrance exam, you can solve 100% of the problems in a reference book because you have memorized it by heart. But if you are given a more in-depth question to test your comprehension on a real university entrance exam, you won’t be able to handle it at all. (But this example is actually more than just illustrative; it’s a useful perspective to think about strategies to prevent overfitting.)
Overfitting is also a problem caused by how well the sample data can cover the distribution of data that can occur in real production. If the sample data has similar value configurations and distributions as the production data, overfitting will not occur. However, it is difficult in reality to prepare sample data in such a perfect state. Sample data is always destined to be less than the amount of data necessary and sufficient because of its position. Even if you’ve borrowed a trained model or expanded the sample data set through the increasingly popular Data Augmentation technique as a countermeasure, if it is significantly contrary to the distribution of the production data, it will result in overfitting.
Even if you can’t make the sample data perfectly, it seems to be OK only if the model fits the sample data (i.e., less bias) and the actual production data (i.e., less variance). But sadly, there is a trade-off between those, and it’s up to luck. That’s why overfitting can be an inevitable and deep-rooted problem.
In actual projects, there are cases where sample data is inevitably very small, depending on the circumstances of the project and the approach that can be taken, and there are also cases where it is impossible to respond to changes in the external environment due to underestimating factors in the environment outside the model that have not been incorporated into the model. For example, even if you create a model to predict the price of a financial product, the amount of sample data is far less than the amount of image data for object recognition. For other example, disasters, political trends, and new financial regulations can have a big impact, and these things can sometimes happen, even if they have never happened before. So inevitably, even though the back-testing is perfect, the actual predictions don’t work out the way they are supposed to. For interesting example, when the input data in the production environment changed to be of higher quality, the model may get to behave as if it were overfit. Medical AI that learns with low-resolution image data could be misdiagnosing people when the input data becomes high-resolution with the subsequent evolution of cameras. Which is overfitting, too. Then, in some cases, It would lead to invasive tests and procedures that could put lives at risk. It is a sort of blind spot that changing input data to be of higher quality can turn a model into an overfit state.
In other case, overfitting can also happen in real projects, sadly, where sample data set includes somehow data that will never exist in the production environment (i.e., mistakes). Otherwise, there may be cases where the distribution of the production data is incredibly unique (or messed up, etc.) and it is difficult to know how and to what extent to take the sample data, etc. Then, possibly, you may aim for a complex model to cover such idiosyncrasies, and you may not train the model well. Either will lead to overfitting in basic machine learning. Conversely, aiming for too simple a model can cause underfitting, which is the opposite of overfitting.
Therefore, when applying supervised learning, it is important to have a solid understanding of the actual production data that will occur. Then, the degree of freedom and complexity of the model, as well as the training results, should be checked through cross-validation using test data prepared separately from the sample data. If necessary, you change the way you hold the data by collapsing or discretizing it. In this way, you can moderate the complexity of the model and prevent overfitting as much as possible.
(By the way, in training with Deep Learning, when you secure large amounts of sample data, the model complexity doesn’t always mean the possibility of overfitting. Even in cases where there is a risk of overfitting as described above, the larger the model, the more robustness and generalization performance may be, depending on the application. That’s also the depth of the Deep Learning world of nonlinear, parameter-rich learning.)