A tester guide for testing machine learning models

Testers guide for Testing Machine Learning Models

Mukund Billa
Analytics Vidhya
Published in
7 min readOct 12, 2019

--

Machines learning is a study of applying algorithms and statistics to make the computer to learn by itself without being programmed explicitly. Computers rely on an algorithm that uses a mathematical model. This model uses a data set which is known as “Training Dataset” to learn and to predict the desired outcome. There are multiple learning algorithms that can be used to solve the problem but the concept remains the same. All these algorithms fall into two categories viz. Supervised learning or Unsupervised learning.

Let’s find out more about supervised learning as it is much more researched and used in applications like user profiling, recommended products list, etc. Supervised learning output generates two types of values and is classified in to two, one is Categorical(Classification Model) where the value is from the finite set(male or female, t-shirt or shirt or innerwear, etc) and another one is Nominal(Regression Model) where the value is a real-valued scalar (income level, product ratings, etc). These algorithms are trained using the dataset and the outputs are predicted.

Please note that the machine learning algorithm doesn’t generate a concrete output but it provides an approximation or a probability of outcome.

As a tester have you ever wondered how can we test application which learns by itself and correct its old mistakes. Don’t Worry!! Hold on before you fall off and read this article….

Without much wait let’s find out what testing approach one must take to test such learning algorithms.

Testing approach: The answers lie in the data set. In order to test a machine learning algorithm, tester defines three different datasets viz. Training dataset, validation dataset and a test dataset (a subset of training dataset).

Please keep in mind the process is iterative in nature and it’s better if we refresh our validation and test dataset on every iterative cycle.

Here, below is the basic approach a tester can follow in order to test the developed learning algorithm:

  1. Tester first defines three datasets, training dataset(65%), validation dataset(20%) and test dataset(15%). Please randomize the dataset before splitting and do not use the validation/test dataset in your training dataset.
Partition of the dataset
Different Dataset fed to the ML Models

2. Tester once defines the data set, Will begin to train the models with the training dataset. Once this training model is done, the tester then performs to evaluate the models with the validation dataset. This is iterative and can embrace any tweaks/changes needed for a model based on results that can be done and re-evaluated. This ensures that the test dataset remains unused and can be used to test an evaluated model.

Phases of ML Model evaluation: Train, Validate and Evaluate.
An iterative process to evaluate the best machine learning model

3. Once the evaluation of all the models is done, the best model that the team feels confident about based on the least error rate and high approximate prediction will be picked and tested with a test dataset to ensure the model still performs well and matches with validation dataset results. If you find the model accuracy is high then you must ensure that test/validation sets are not leaked into your training dataset.

An iterative workflow of training, evaluating and testing of ML models

What if we train them with incorrect data??? If we train a model with incorrect data set, then the error rate increases and will lead to Data Poisoning. Models must be trained with an adversary dataset as well such that the system should be capable to sanitize the data before sending it to train models.

With the above information, let’s understand an important concept called “Cross-Validation” that helps us to evaluate the model's average performance.

Cross-Validation

Cross-validation is a technique where the datasets are split into multiple subsets and learning models are trained and evaluated on these subset data. One of the widely used technique is the k-fold cross-validation technique. In this, the dataset is divided into k-subsets(folds) and are used for training and validation purpose for k iteration times. Each subsample will be used at least once as a validation dataset and the remaining (k-1)as the training dataset. Once all the iterations are completed, one can calculate the average prediction rate for each model.

Let’s understand with the below diagram:

Each sub-sample will be used at least once as a validation dataset across all iterations.

Now we know the testing approach, the main part is how to evaluate the learning models with validation and test dataset… Let’s dig into it and learn the most common evaluation techniques that a tester must be aware of.

Evaluation Techniques:

There are certain terminologies that we need to understand before diving into the evaluation techniques. So let’s first know what they are.

With the above basic terminologies, now let’s dive into the techniques:

  1. Classification Accuracy: It’s the most basic way of evaluating the learning model. It’s a ratio between the positive(TN+TP) predictions vs the total number of predictions. If the ratio is high then the model has a high prediction rate. Below are the formulas to find the accuracy ratio.

However, it is seen that accuracy alone is not a good way to evaluate the model. For e.g. Out of 100 samples of shapes, the model might have correctly predicted True Negative cases however it may have a less success rate for True Positive ones. Hence, The ratio/prediction rate may look good/high but the overall model fails to identify the correct rectangular shapes.

2. Confusion Matrix: It’s a square matrix table of N*N where N is the number of classes that the model needs to classify. It’s best used for classification models that categorizes an outcome into a finite set of values. These values are known as labels. One axis is the label that the model predicted and the other is the actual label. To understand more about this, let’s categories the shapes into 3 labels [Rectangle, Circle, and Square]. As there are 3 labels, we will draw a 3*3 table(Confusion Matrix) of which one axis will be actual and the other is the predicted label.

Confusion matrix of 3[Actual]*3[Predicted] table. [Note: Remarks column is for the understanding purpose]

With the above matrix, we can calculate the two important metrics to identify the positive prediction rate.

Precision: Precision identifies the frequency with which a model was correct when predicting the positive class. This means the prediction frequency of a positive class by the model. Let’s calculate the precision of each label/class using the above matrix.

Precision is a ration between (True Positive) vs (True Positive +False Positive)
Precision calculations for each label/class

With the above calculations, the model is 76% of the time is correct when predicted as the rectangle shape. Likewise, 72% and 42% of the time is correct when predicted the circle and square shape.

Recall: This metric answers the following question: Out of all the possible positive labels, how many did the model correctly identify?. This means, the percentage of correctly identified actual True Positive class. In other words, recall measures the number of correct predictions, divided by the number of results that should have been predicted correctly.

Recall calculation for each label/class

The above simply means that the model has a correct prediction of 66%, 53% and 60% for rectangles, circles, and squares.

What if the threshold value is increased, then the resultant number of correct predictions will be declined which will lower the recall value. Or if the threshold value is lowered then the true predictions will be higher which results in increased precision but will have incorrect predictions as the positive class. To have an optimized metric, we may use the F1 measure which is defined as below. This gives us a score between 0 and 1 where 1 means the model is perfect and 0 means useless. A good score tells us that the model has low false positives[the other shapes which are predicted as rectangles] and low false negative[the rectangles which are not predicted as rectangles].

F1 Measure formula

There is another evaluation technique called ROC[receiver operating characteristics] and AOC[Area under ROC curve] which needs to plot the graph based on two different parameters [True Postive Rate(TPR or Recall) and False Postive Rate(FPR) for various thresholds. However, we will cover this evaluation technique in our later article.

The above described is a basic testing approach and evaluation technique for a system that is embedded with learning capabilities.

--

--