Ambiguities about “Quality” versus “Accuracy” of your machine-learning model- What should be the priority?

Demystify few reasons behind inaccurate predictions of highly accurate models

Amol Marathe
Data Science Insights and Predictions
3 min readSep 4, 2020

--

Many budding data scientists are often trapped in the vicious circle of “Quality” and “Accuracy” of a machine-learning model. It is not uncommon for a model to have poor quality and still deliver illusionary high accuracy (over 98%) when verified on the test data. However, the same model produces dramatically inaccurate results when confronted with new data. In this article, we will uncover the intricacies of maintaining the quality and accuracy of a model by keeping it simple by design.

Firstly, a model is considered to have good quality when it has an optimum fit for all the data points in the training set. To put it in other words, the regression function should optimally generalize the data points by avoiding to cover each and every input in continuity. When a regression function tries to fit the maximum number of data points, it inherently becomes a complex model and hence not preferred. Such a model is termed as an overfit model. Our intention is to find the simplest possible model, but not too simple which is very close to the mean of the data. Such an oversimplified model is termed as an underfit model. The following graphs illustrate it very well.

Regression function curves generalizing the input data points

Secondly, when we compare the accuracies of these models, the overfitted model has illusionary high accuracy (typically more than 90–95%); however, it is true only for the training data or even test data in some cases if the sample is biased. One of the reasons for this disparity is the overfitted model tries to memorize the data points along with the noise in the data. When such a model is fed with unseen data, the accuracy is nowhere close to what the model was boasting of on the test data. For this reason, we need an optimized model that would maintain a perfect balance between complexity and accuracy. The following image highlights the major differentiating points in an underfit and an overfit model.

Comparison: Overfitted vs Underfitted model

Finally, you must remember that the model complexity and variance of the sample data are intimately tied to each other. As the variety of input data points and the sample size increases, you can build a more complex model without the risk of overfitting. In the real world, you must adopt the strategy to collect more data instead of trying to tweak and tune your model for the limited available data. In conclusion, in many business scenarios, it is ‘OK’ to sacrifice for the accuracy over the quality of the model because ultimately the high-quality model would be the winner for optimally generalizing the predictions for unseen data. The repercussions of your strong inclination to achieve high accuracy at any cost would definitely put the business in jeopardy and yourself in the risk of failure.

--

--

Amol Marathe
Data Science Insights and Predictions

Author, Explorer and Researcher in Data Science