Avoiding Machine Learning Pitfalls: From a practitioner’s perspective — Part 5

Published in

WiCDS

4 min readJan 19, 2023

Image credit: https://github.com/wandb/examples

The idea of this blog is to show how to report your results once you’re done with building models and have a way to compare the different models you have built. It is imperative to report your results in a transparent way so that others can reproduce and reuse your models without any issue.

Stage 5: How to report your results
The ultimate objective of a research work is to be consumed by others to expand knowledge. In order to effectively do that, the work should be well documented explaining what has been tried, what worked and what did not, the trade-off of using one model over other, and the conclusions along with acceptable proofs. The same is true for applied research as well. All the experiments carried out by the data scientists should be tracked at all levels, right from capturing the hyperparameters to metrics so that it can be easily consumed and reproduced by others.

i. Be transparent — Always make sure that all of your tried approaches is well documented and is reproducible. Share the code and all the artifacts [use MLFlow or W&B to track] necessary for reproduction. Keep re-usability as a core component of your work. No one should need to reimplement the same approaches from scratch all over again. Maintain proper documentation, and use clean-coding habits so that it is easy for anyone (including yourself) to understand your work. Doing so, will also help in publishing your work in the future if the necessity arises.

ii. Report performance in multiple ways — Use multiple test datasets to evaluate and compare models and average out the result. The community benchmarks should be taken with a pinch of salt. So, try out your own experiments on multiple datasets to robustly evaluate the models and report them. Also, use multiple metrics while reporting results since one metric will not fit all scenarios (keep the KPI in mind). Apart from choosing the metric that is based on the problem, report other metrics too to provide a complete picture. For example, if you think F1 score as a best metric to your problem also report other metrics like accuracy, precision and recall to provide different perspective on the result. It is important to state the metrics clearly, for instance, if you use F-score, specify if it’s F1 or F0.5. This will help keep the metrics transparent.

iii. Don’t generalize beyond the data — If a model works well on a single dataset, it doesn’t mean that the model generalizes on real-world data. Refrain from presenting invalid conclusions to the business about the model’s performance. There are multiple reasons why you should not be doing this,
- Inherent sampling error or bias in the data curation process is one major reason,
- The same bias may be present in multiple datasets even if you use more than one dataset to evaluate a model,
- And checking the quality of the data used in deep learning process seems impossible because of the quantity of data required to train a deep learning model.
So, only report the results backed by data and nothing else.

iv. Be careful when reporting statistical significance — Statistical tests can be used to compare different machine learning models, however they are not perfect and can under/overestimate significance. It is better not to use a pre-defined threshold to determine significance, but instead report p-values and let the business interpret the results. Additionally, it’s important to consider whether the difference between models is actually important and one should measure effect size to get a better understanding. Use techniques like Cohen’s d statistics or Kolmogorov-Smirnov test to measure the effect size. Bayesian statistics is an option to consider as well.

v. Do look at your models — Models contain lot of useful information. Examining the trained model and understanding how it reaches its decisions is important in order to gain knowledge and understanding, which is more important than just receiving a slightly higher accuracy. Doing so, also helps the business make better decisions. For simple models such as decision trees, visualizations can be produced to make the model more understandable. For more complex models there are explainable AI (XAI) techniques that can be used to gain some insights and get a better understanding of why a model made a certain prediction.

* Note: This is the final part of the 5 Part series on “Avoiding Machine Learning Pitfalls: From a practitioner’s perspective”. Before you read this blog, please have a look at Part 4 to understand how to compare models fairly.

Thank you for reading and appreciate your feedback. See you with another interesting topic soon!

Avoiding Machine Learning Pitfalls: From a practitioner’s perspective — Part 5

Published in WiCDS

Written by Abinaya Mahendiran