Avoiding Machine Learning Pitfalls: From a practitioner’s perspective — Part 4

Published in

WiCDS

4 min readAug 9, 2022

Image Credit: By user3000877 from https://stats.stackexchange.com/users/221237/user3000877 — https://i.stack.imgur.com/QtygM.png, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=113194144

This blog will throw light on how to compare the models that you have built fairly. As part of any ML project, a data scientist will start out with a baseline model and conduct a variety of experiments with different models that may suit the problem. Once the data scientist completes the experiments, he/she needs to compare all the models fairly, i.e., with the same context and pick the best model for deployment.

Stage 4: How to compare models fairly
In order to effectively compare models, use proper statistical tests. Fair model comparisons will help other data scientists in reproducing the results in a better way. It is easy to compare models unfairly and report wrong or useless metrics. Always make sure that the comparison is done in the same context.

i. Don’t assume bigger number means a better model — In applied research, you may end up fine-tuning a pretrained model or build a model from scratch depending on the use case. In either case, when experimenting with multiple models, always make sure that you evaluate the models on the same settings, like using same datasets for all the experiments, taking train and test splits from same partition of data, and performing similar hyperparameter optimization for different algorithms. When in doubt about the reported results, conduct all the experiments again and evaluate those multiple times and then use relevant statistical tests to assess the significance of the differences in the performance.

ii. Use statistical tests when comparing models — To make sure that the results you report is trustworthy, use statistical tests. There are two types of tests that are available,
— Test individual instances of a model through McNemar’s test. McNemar’s test compares the output label from the two instances of the model on the test dataset and it is a common test that is used to compared two classifiers.
— Test the distribution of two different models. If the models are built using cross-validation or repeated resampling, then use Student’s T test if the distributions are normal (which is not the case in most of the real-world scenario) or Mann-Whitney’s U test (which does not assume the distributions to be normal).

iii. Do correct for multiple comparisons — Comparing more than two models on the same test set will result in overtly-optimistic performance of models. A confidence level (95%) is set in pairwise tests to show the significance of differences. Sometimes, there is a small probability that the statistical tests may discover significant differences when there are none. If multiple pairwise tests are conducted with a confidence level (95%), then at least one of them will give wrong results. This is referred to as multiplicity effect and is an example of data dredging or p-hacking. To avoid this, use Bonferroni correction, in which lower confidence threshold is used based on the number of tests being carried out.

iv. Do not always believe results from community benchmark — In the research perspective, using the benchmark dataset for developing and evaluating multiple models will result in over-optimistic estimation of the performance of the models. It is possible that the earlier models built on the dataset uses unrestrictive test set and there is no guarantee that it has not been used in the training process (developing to the test set). Also, if the same test set is used for evaluating multiple models, there is a chance that all the models are simply over-fitting the test data and they cannot generalize better than the other models. Read into the benchmark results cautiously!

v. Do consider combination of models — If you are able to find a single best model for the problem at hand by comparing multiple models, then go ahead with using the best model. But, in real-world scenario, having an ensemble of multiple models will help in achieving better results than having a single best model. Ensemble learning brings the best of all the models by compensating for the individual model’s weakness with another model’s strengths. You can either create an ensemble of same base models using techniques like bagging and boosting for tree based models or create an ensemble of different base models like a tree based model with a neural network using techniques like stacking or stacked generalization. Remember, all these choices should be driven by the target KPI or metric.

* Note: This is Part 4 of the 5 Part series on “Avoiding Machine Learning Pitfalls: From a practitioner’s perspective”. Before you read this blog, please have a look at Part 3 to understand how to robustly evaluate models.

Thank you for reading and appreciate your feedback. Stay tuned for the next part!

Avoiding Machine Learning Pitfalls: From a practitioner’s perspective — Part 4

Written by Abinaya Mahendiran