Avoiding Machine Learning Pitfalls: From a practitioner’s perspective — Part 3

Abinaya Mahendiran
WiCDS
Published in
4 min readJul 12, 2022
Credit: Wikipedia

In this blog, let us try to understand how to robustly evaluate the models that you have built. Model building is an interesting process, but you cannot stop at just that. You will need to evaluate the model that too fairly and derive insights through valid reproducible results so that it can be used by business. Any ML or DL model that does not deliver the expected KPI is useless in real world.

Stage 3: How to robustly evaluate models
Since you are building a model to ultimately solve a business use case, you need to make note of how the data is being used in your experiments, how the right metrics, that measures the generality of the model, are chosen and how insightful are the reported results. Here are some of the steps that can help in evaluating a model,

i. Use appropriate test set to test the generality (how well a model performs on unseen data) of a model. As discussed in Part 2, always keep aside a randomly chosen hold out dataset (test data) and make sure that it is representative of the wider population. The test dataset set should not overlap with the train dataset, but it has to be representative of the real-world data distribution. Imagine training an image classification model only on the pictures taken in proper lightings and evaluating it on the images taken in the dark. The model will not generalize and is of no use. Always, make sure that you use appropriate test set to evaluate the performance of the model. While collecting data for solving the business problem at hand, make sure the data is representative or close to the real-world scenario.

ii. Use a separate validation set to fine tune the hyperparameters of the model and to evaluate its performance while training. Model should not be fine-tuned based on its performance on the test set. This will result in overfitting and prevents the model from generalizing. Instead, validation set should be used to guide the process of choosing the right hyperparameters that can in turn improve the model’s performance on unseen data. If you observe that the validation accuracy is falling, then the model is overfitting the train data and the training should be stopped. This process is called early stopping. [Practical tip: Once you have finalized a model based on its performance, you can use all the validation data to further train the model ensuring that the labeled data at hand does not go unused.]

iii. Evaluate the model multiple times since model performance is unstable and unreliable if the model is trained multiple times (different experiments) with small changes to the training data and evaluation is done only once. So, use techniques like cross-validation (k-fold cross-validation) where the train data is divided into k subsets and k-1 folds are used for training the model multiple times and evaluation is done on the single held out fold. When reporting the results, use mean or standard deviation of the chosen metric. This gives a fair evaluation of the model. If the data is imbalanced, use stratification to ensure there is enough representation of the data in each fold.

iv. Save data to evaluate your final model instance. Model’s generality should be tested on a test set that is kept aside in the beginning of the project. This test set should be representative of a wider population like described in point i. You may use k-fold cross-validation to build multiple models on different subsets of data and choose the best model instance based on the evaluation of that fold’s test set. There is a high chance that the high scoring fold may contain the easiest test data which is smaller in size, and it may not be representative. In such a scenario, where you have enough data to solve the problem at hand, always use the held-out test data to evaluate the model performance other than relying on the evaluation of the specific fold’s test data.

v. Never use accuracy for imbalanced datasets because accuracy is not a good measure. In general, for any classification problem with balanced dataset, accuracy is used as a measure. But most of the problems that you see in the industry is imbalanced in nature and you may have to choose the right metric to report based on the business objective. For example, in a spam mail classification problem, where you have 10% samples for spam and 90% samples for genuine mails, the model may always predict a mail to be genuine 90% of the time. Though the model has 90% accuracy, it is completely useless. In such a scenario, apart from trying to balance out the samples for minority class, you may also need to choose the right metric to report. If the problem you are dealing with requires high precision over recall or vice versa, use them directly. If not, use F-score which weighs the precision and recall equally, Cohen’s kappa coefficient or Matthews Correlation Coefficient (MCC) which are not very sensitive to class imbalance.

* Note: This is Part 3 of the 5 Part series on “Avoiding Machine Learning Pitfalls: From a practitioner’s perspective”. Before you read this blog, please have a look at Part 2 to understand how to reliably build models.

Thank you for reading and appreciate your feedback. Stay tuned for the next part!

--

--