Explaining what learned models predict: In which cases can we trust machine learning models and when is caution required?

7 min readJul 22, 2021

Introduction

Machine learning models have been increasingly popular in many fields and most decisions are made directly or indirectly based on the results of these models in today’s world. However, certain precautions need to be taken before fully relying on machine learning models. Trusting a machine learning model, in general, can be interpreted as creating a robust model which gives largely accurate results and has a high generalization ability.

Classification models learn from past experience, so robustness of a model primarily depends on the given training set. In most scenarios, feature engineering and preprocessing can deal with the problems in data. Some of the most common types of these problems are Insufficient Data, Class Imbalance, Missing Value Problem and Curse of Dimensionality. Moreover, even if the input data is well prepared, some issues such as Overfitting and Misleading Performance Metrics may arise during the learning phase. In this essay, such problems and when they occur will be discussed, along with providing possible solutions.

Possible Problems in Data

Insufficient Data

In certain cases, the data fed into the statistical learning model can be unrepresentative of the entire data (i.e., full population) due to data insufficiency. Since the main approach of training is to be able to classify the unseen data using learned parameters (or values) obtained from the available data, insufficiency can lead to high systematic errors in predictions. To put it another way, even though the model has learned from the training set successfully the model still can have poor performance due to the discrepancy between the seen and unseen datasets*. As claimed by Niyogi et al. [1], there must be enough examples in the training set for all probabilistic learning machines and generalization error would otherwise increase regardless of how perfect the chosen target function is. Artificial data creation methods (usually depend on over-sampling) could combat this problem [1, 2].

*Note that, splitting the labeled data into training and test sets incorrectly can also lead to the same problem.

Class Imbalance

Imbalanced (or unbalanced) datasets are very common in real-life problems and these problems should be addressed carefully. This issue occurs when some classes in the target variable predominantly outnumber the other classes, which results in the model learning the major classes much more than other classes. In a study [3], it is clearly stated that imbalanced datasets cause several difficulties in learning and if not solved, they could mislead the evaluation result as if the accuracy of the model was high. According to Mountassir et al. [4], there are 2 main techniques to tackle this problem. While the first method involves changing the classifier (e.g., weighting the classes inversely proportional to the number of classes [5]), the second one focuses on the modification of the data using under-sampling or over-sampling.

Missing Value Problem

Most statistical learning algorithms are incapable of dealing with missing value (or missing data) problem, although there are some novel tree-based algorithms (e.g., LightGBM or XGBoost) that can handle this problem. Furthermore, because the unseen data of the same data generating process are likely to have missing values, preprocessing is required on the new data before the prediction. It is not an easy task to mitigate missingness problem in data. MCAR (missing completely at random), MAR (missing at random), MNAR (missing not at random) are defined as the main types of missingness in data and each type requires a different solution [6]. For instance, when the data are missing completely at random, the instances containing missing values can be deleted [7, 8]. In some cases, various imputation techniques can be applied using statistical and machine learning methods [9].

Curse of Dimensionality

The curse of dimensionality is another popular problem that occurs when the dataset contains too many dimensions (or columns). As the number of dimensions increases, the space containing data becomes more sparse and all data points begin to diverge. Machine learning algorithms thereby tend to miss the patterns in data since learning depends on the data points and their interactions. Jianqing Fan and Yingying Fan [10] explain that the noise features that do not reduce the classification error are the underlying cause of the difficulties of high dimensionality. They furthermore state that the misclassification rate increases when only a subset of dimensions is responsible for the variation in the data. Similarly, according to Pedro Domingos [11], high dimensionality makes the generalization of machine learning models exponentially worse. Applying dimensionality reduction methods (e.g., principal component analysis) on the least important features is highly recommended.

Possible Problems in Learning

Overfitting

In the learning phase, a labeled (and prepared) dataset is used for constructing a hypothesis function. On the other hand, the essential goal of that is to predict the new unlabeled dataset. If hypothesis function becomes too specific and memorizes almost all patterns in the training set by being sensitive to random noises, the model could have poor performance on future data. Moreover, evaluation results can be misleading and high accuracy values may be considered successful for a robust machine learning model.

Overfitting can often occur in neural networks and polynomials. To give an illustration, Fig. 1 demonstrates this concept with polynomial interpolation [12]. Srivastava et al. [13] offer a solution called dropout to avoid overfitting in neural networks. Tom Dietterich [14] suggests that augmenting the target function with penalty terms (such as regularization, minimum description length, and cross-validation) can help combat the general overfitting problem.

Fig. 1. After order 16, overfitting occurs in the model function with high variance [12].

Misleading Performance Metrics

After training, it is necessary to check performance (or evaluation) metrics to see if the selected objective function is a good fit. Choosing the wrong metrics could misguide the researcher. For this reason, appropriate performance criteria should be chosen carefully. There are numerous performance metrics commonly used in classification, including Accuracy, Precision, Recall (or Sensitivity), F-measure, Area under the ROC Curve. The selection of such metrics often depends on the project’s problem definition, i.e. what is considered more important to the researcher. For instance, if the aim is to prevent false negatives as much as possible, recall can be a good choice. Note that, using one metric might not be sufficient. Seliya et al. [15] recommend using multiple performance metrics that are not highly correlated, so each metric could represent a different aspect of classifier performance.

Conclusion

Predictions of machine learning models directly affect the critical decisions in the real world. However, building a reliable machine learning model is indeed a challenging task and researchers need to scrutinize our machine learning model in order to trust its predictions. Various problems can occur due to data characteristics or techniques used in learning processes. This essay has discussed trusting a machine learning model in a technical context and has explained in which cases the researcher should be careful. As a result, if one pays attention to the points discussed in this essay and mitigates the problems mentioned, machine learning models can be trusted for generalization and accuracy.

References

[1] P. Niyogi, F. Girosi, and T. Poggio, “Incorporating prior information in machine learning by creating virtual examples,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2196–2209, 1998.

[2] M. Lechleitner, “Small data oversampling: improving small data prediction accuracy using the geometric smote algorithm,” Ph.D. dissertation, NOVA Information Management School (NIMS), 2020.

[3] G. E. Batista, R. C. Prati, and M. C. Monard, “A study of the behavior of several methods for balancing machine learning training data,” ACM SIGKDD explorations newsletter, vol. 6, no. 1, pp. 20–29, 2004.

[4] A. Mountassir, H. Benbrahim, and I. Berrada, “An empirical study to address the problem of unbalanced data sets in sentiment classification,” in 2012 IEEE international conference on systems, man, and cybernetics (SMC). IEEE, 2012, pp. 3298–3303.

[5] J. Brank, M. Grobelnik, N. Milic-Frayling, and D. Mladenic, “Training text classifiers with svm on very few positive examples,” Technical Report MSR-TR-2003–34, Microsoft Corp, Tech. Rep., 2003.

[6] R. J. Little and D. B. Rubin, Statistical analysis with missing data. John Wiley & Sons, 2019, vol. 793, pp. 11–12.

[7] M. Nakai and W. Ke, “Review of the methods for handling missing data in longitudinal data analysis,” International Journal of Mathematical Analysis, vol. 5, no. 1, pp. 1–13, 2011.

[8] M. Soley-Bori, “Dealing with missing data: Key assumptions and methods for applied analysis,” Boston University, vol. 23, 2013.

[9] K. Lakshminarayan, S. A. Harp, R. P. Goldman, T. Samad et al., “Imputation of missing data using machine learning techniques.” in KDD, 1996, pp. 140–145.

[10] J. Fan and Y. Fan, “High dimensional classification using features annealed independence rules,” Annals of statistics, vol. 36, no. 6, p. 2605, 2008.

[11] P. Domingos, “A few useful things to know about machine learning,” Communications of the ACM, vol. 55, no. 10, pp. 78–87, 2012.

[12] S. Lawrence, C. L. Giles, and A. C. Tsoi, “Lessons in neural network training: Overfitting may be harder than expected,” in AAAI/IAAI. Citeseer, 1997, pp. 540–545.

[13] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The journal of machine learning research, vol. 15, no. 1, pp. 1929–1958, 2014.

[14] T. Dietterich, “Overfitting and undercomputing in machine learning,” ACM computing surveys (CSUR), vol. 27, no. 3, pp. 326–327, 1995.

[15] N. Seliya, T. M. Khoshgoftaar, and J. Van Hulse, “A study on the relationships of classifier performance metrics,” in 2009 21st IEEE international conference on tools with artificial intelligence. IEEE, 2009, pp. 59–66.