Designing a QA Strategy for a Machine Learning application

Leonardo Pace
Published in
14 min readNov 9, 2020


Artificial Intelligence in Software Testing


The goal of this article is to serve as a guide for all the quality analysts facing the need of designing a test plan for a system that implements an artificial intelligence model based on supervised learning. Given the wide nature of this topic, and for practical purposes, some Machine Learning concepts need to be addressed, and they are explained at a very high level. The main objective is to provide a guide to the different testing techniques applicable to each stage of the design.

AI-based systems vs “conventional” applications

AI-based implementations are different from traditional projects, even in regards to the testing strategy. A traditional “deterministic” testing approach based merely on comparing the actual versus the expected result may not be accurate enough.

This doesn’t mean that there is no logical relationship between the input and output conditions, but the existing correlation is not always so obvious and linear, and most importantly, we are not validating unequivocal results (E.g: password OK, user OK, then login successfully…), but the ability to learn that a model has.

This is because of the concept of what we are testing, how we do it and when we do it is significantly different for AI-based systems where its function is to predict or classify (any type of) data, instead of following a series of script-based actions whose expected result depends on the steps that were executed previously.

Remember: the most important functionality under test is the ability to learn.

Having said this, in AI systems, the role of the quality analyst becomes more strategic and analytical.

When to consider the implementation of AI?

  • When we cannot program the rules due to the complexity or multiplicity of variables that affect the result
  • When a large number of factors depend on the results
  • When the quantity of information to be processed is very high

Some examples:

  • An airline wants to automatically calculate the value of tickets for the next years, based on historical data.
  • A manufacturer of computer security systems wants to develop software that identifies attacks or intrusions from genuine requests.
  • A real estate consultancy wants to calculate the sale price of a property based on historical data of houses with similar characteristics.
  • A streaming platform wants to implement customized marketing actions specifically oriented to those users who may cancel their subscription.

When (and where) QA should start?

Different stages of the development cycle of an AI-based system (and the role of quality in each of them)

Let’s start with a, somewhat controversial, statement, that will allow us to make our first steps in the world of AI Testing: To start getting involved in the world of Artificial Intelligence, from a QA perspective, you don’t necessarily need a deep understanding of mathematics or algebraic foundations of the machine learning models.

This way of understanding a system is well known to us: a black box testing strategy. Under this approach, the intelligent agent is a box of which we are not interested in knowing what happens inside (at least for now…) we are only interested in the interaction of this system with the environment. We need to know what it does and not how it does it. Starting from this premise, we are ready to define the bases of one of the many testing strategies that we can implement for an AI-based system.

In order to identify the role of the Quality Engineers and the expected outcomes during the different phases of development of an AI implementation, we can define the following stages:

(keep in mind that this classification is for illustrative purposes and includes the most common ones, but of course, it may vary depending on each particular implementation)

Setting Goals

Here is where all starts. Setting objectives is obviously related to the business strategy and what kind of results are intended to be addressed. Artificial intelligence does not always provide a viable and justifiable alternative for any implementation (many problems can be solved without implementing this type of technology). But in cases where large volumes of data are handled and valuable information can be obtained through their deep analysis and understanding, AI becomes crucial and can accelerate not only development times but also the obtaining of results.

Collecting Data

Once the goals have been defined, the next step is to select the data that business analysts and product owners consider, in the first instance, that may become relevant to carry out the prediction. This task can be defined jointly with data scientists.

Let’s suppose a system designed to predict which users of a streaming platform could cancel their subscription and, based on this, implement some kind of predictive marketing actions. Perhaps a relevant piece of information would be to identify those users who have carried out many unsuccessful searches, who did not obtain results according to their interests or simply users who did not access the platform in recent months. This is considered relevant data since there could be a “logical” connection between these events and the possible cancellation of the service.

What can we contribute at this point from a QA perspective? Well, the same as in a “traditional” project when we define with business analysts the scope of testing, the suite of test cases, and the source of the data (and its characteristics). There is always some scenario that from our imagination we can contribute. “What if …” is one of our favorite sentences as experienced quality analysts… and this simple question can change the course of a project.

Preparing Data

It’s at this point where the role of the quality analyst becomes more important and you can already imagine why. Once we have selected all the relevant data that will serve as an input variable to our model, we must ensure that their “presentation” is correct, because then it needs to be processed. This is intrinsically related to big data testing in the more traditional sense of the term.

Here we must apply a testing approach similar to the one we implement when working on an ETL process. We can start by standardizing the labels that refer to the same data. This could be a good starting for our quality strategy.

This is because, as in all ETL processes, we must take into account the possibility that the same types of data come from different sources. Let’s go back to the example of the users of a streaming platform. Perhaps, the field that refers to the “user”, is coming from different developments and platforms and does not have a “standardized” label, for example, the user of a game console can be sent as “user_id” and “user” from a smart tv and “customer_id” from a mobile phone. Exclude from the dataset all values ​​that are beyond the mean, analyze those singularities that are not representative of the sample, looking for empty or “null” values, check the format of the fields, their type, what range of values are expected… always remember that our model is fed by data. No matter how well programmed it is, if the data we pass to it is incorrect, the model will fail in its goal, and we cannot blame him.

Selecting Algorithm

As we stated before, the goal of the implementation of machine learning is to create a model that allows us to solve a given task. Once the model is defined, it is trained using large amounts of data. The model learns from this data and then is capable of making predictions. Depending on the task you want to perform, it will be more appropriate to work with one algorithm or another.

The diversity of existing algorithms, their characteristics, the mathematical foundations that underlay the algorithm and their use in machine learning is a fascinating world on which I recommend researching, but of course, it exceeds the purpose of this article.

Some of the most important supervised-learning algorithms are:

  • Linear regression
  • Nonlinear regression
  • Generalized linear models
  • Decision trees
  • Neural networks

Typically, the quality engineer is not the one who defines the model to be implemented, this constitutes the main responsibility of the data scientists and is part of the overall implementation design. But as we will see later, is the responsibility of the quality analyst to measure the performance of the model based on different metrics that we can obtain from it and provide feedback, and that of course complies in the best way with the goal related to the business rules. Therefore, It would not be unreasonable to think that from the quality perspective we can provide our feedback to the data scientist if we see that the model does not behave as expected, the type of data doesn’t match with the expected input, and it would not be risky to think that based on the feedback, data scientists could “rethink” the best strategy including the use of another model. But I insist, this is not expected under normal conditions …

Training Model

Regarding how much data is necessary to train a machine learning model, it is a question that would far exceed the purpose of this article and we can find a lot of articles about this topic. I recommend the following post for those interested: “How do you know you have enough training data?

Training the model is the primary and most critical task that the data scientist performs. Understanding how it is performed is essential for the quality analyst in charge of testing it.

As we said before, in order to train a model, we need to collect data. The quality, quantity, and accuracy of training data will determine how good your predictive model can be.

Usually, the datasets are split into Training, Validation, and Testing Data. It is common to allocate 50 percent of the data to the training set, 25 percent to the test set, and the remaining to the validation set. The training set includes the input “examples” that the model will be trained on by adjusting the parameters. The validation dataset allows the model to learn from its mistakes, is when the model adjusts its parameters based on the evaluation results. Finally, the testing datasets will allow the final evaluation of the model, and where we can obtain different metrics, like the accuracy, precision, etc. In the test dataset, the value we want to predict is not passed to the model. It will “guess” it based on the training and validation data.

Evaluate Model

Here is where our metrics appear. The three main ones used to evaluate a classification model are accuracy, precision, and recall. We are going to see in detail these metrics in the section “Metrics we can obtain from the confusion matrix”

Make Predictions

Once the model is ready, it is time to carry out the predictions, from the functional perspective.

Typically, from a QA perspective, in this stage is where the model is embedded within the application, we can access our embedded model through a user interface, or developers can provide us with an API in order to have direct access to the system. One way or another, this is where the integration testing of our application begins.

The final step, and once the model is deployed in production, regression tests are executed to ensure that the existing model does not collapse when the underlying change is released and ensure that accuracy is maintained on production with “real world” data.

Evaluating the application according to the acceptance criteria is just one of the expected outcomes from QA. Another, and maybe the most important part, is to evaluate the model in statistical terms, for example, be 90 percent certain that the application will produce the expected output within a given range. We are going to see more in details the metrics that will help us to measure the effectiveness of our model

Machine Learning Types

Now that we have an idea of ​​the different stages that are part of the development cycle of an AI-based system, let’s go more in detail about the different types of machine learning and when to apply each of them.

We could summarize that artificial intelligence is based on three types of learning:

  • Supervised learning
  • Unsupervised learning
  • Reinforcement learning.

In this article, we are going to focus on the first one, where the “human” analysis and classification (labeling, to avoid confusion) of data plays a relevant role.

Different quality techniques will apply to the various types of machine learning since the handling and classification of data (input variables to the system) is significantly different.

Supervised learning

In this article, we are going to be focused on Supervised Learning. This type of learning is based on discovering the relationship between input and output variables. In contrast to unsupervised learning, which takes place when “tagged” data is not available for training and we only know the input data (for example, looking for similarity patterns in the input data, this is also known as clustering).

In a supervised learning model, learning happens from teaching these algorithms what is the result we want to obtain for a certain value after showing many examples.

On the other hand, Reinforcement Learning is a subfield of machine learning where an “agent” learns how to choose an action from all the available possibilities, within a particular environment, to maximize “rewards” over time (Let’s think of a dog that learns to brings us a stick back because he knows that we will give him some food as a reward for his action)

If the conditions are met, the algorithm will be able to give the correct result even when you show its values ​​you haven’t seen before.

Let’s work on the following example: suppose we need to test a system whose purpose is to analyze different variables related to commercial flights and based on this, to be able to predict if the flight will have some kind of delay, based on historical data provided as variables for its “input”.

Our Business Analyst and Data Scientists consider that the following variables are relevant in order to predict delays because over time and by analyzing historical data, they have reached the preliminary conclusion that there is a direct relationship between these variables and the result they want to predict.

  • Departure Airport
  • Destination Airport
  • Weather conditions
  • Airline
  • Distance between airports
  • Departure date

(Of course, no predictive model can be trained with a minimum amount of data, such as the ones exemplified above, but for practical purposes, we are going to use this table to understand, on the one hand, how the data is structured, its importance and what information can be obtained from them)

Performance evaluation of a machine learning classification model

For exemplification purposes, let’s continue with our example from the aeronautical industry and define what the external and internal KPIs are, and let’s see how from both perspectives, we can measure the effectiveness of our model.

External KPIs

We can define these metrics as those that emerge from the correct implementation of our model. As a logical or expected consequence of its correct implementation. We assume that if our model fulfills the purpose for which it was designed, we will be able to obtain correct indicators at the operational level, or at the business level. This is exactly the “business goals” that we mentioned at the beginning of this article. Some examples are:

  • Operational costs related to delayed flights (airport fees)
  • Customer satisfaction metrics
  • Available seats for re-scheduled flights
  • Total Revenue
  • Labor costs, operating expenses

Internal model’s KPIs

Here is where we measure the efficiency of the model, using a series of metrics we can obtain for this purpose. From the functional point of view, we obtain a series of “numbers” that indicate how capable our model is to carry out the prediction.

Confusion Matrix

In the field of artificial intelligence and machine learning, a confusion matrix is ​​a tool that allows visualizing the performance of a supervised learning algorithm. Each column of the matrix represents the number of predictions of each class, while each row represents the instances in the real class. In other words, in practical terms, it allows us to see what types of successes and errors our model is taking into account at the same time.

Although the confusion matrix is not a metric itself, it gives us a matrix as output and describes the complete performance of the model.

The four options in our example would be:

  1. Flights that were delayed and the model classified it as delayed. This would be a true positive or TP.
  2. Flights that were not delayed and the model classifies it as not delayed. This would be a true negative or TN.
  3. Flights that were delayed and the model classified it as not delayed. This would be a false negative or FN.
  4. Flights that were not delayed and the model classified it as delayed. This is a false positive or FP.

Now more clearly, we can identify in our matrix where the errors are located (yellow boxes).

Metrics we can obtain from the confusion matrix

Expected results, or business goals, are related to both external and internal KPIs. For instance, reaching a certain percentage of “Accuracy” can determine the quality sign-off. The most common metrics that we can obtain from our model and a brief explanation of each of them are mentioned below:


The accuracy is an indicator that measures the percentage of cases that the model has successfully predicted, is a ratio of the correctly predicted classifications (True Positives and True Negatives) of the total Test Dataset.

Accuracy = TP+TN/(TP+TN+FP+FN)

In our example, this metric answers the question: of how many flights did the system correctly classify (both True Positives or True Negatives) out of all the flights?


Precision is the ratio between the True Positives and ALL the Positives.

This metric answers the question “what proportion of positive identifications was actually correct?”,

Precision = TP/TP+FP

Precision is the ratio of results that correctly predicted positive observations (True Positives) to the system’s total predicted positive observations, both True Positives, and False Positives.


The recall is the ratio of results that correctly predicted positive True Positives observations, in relation to all Actual Positives results. The percentage of true positives that a model is able to predict, or in other words, of all the flights that were actually delayed, how many of those did the system correctly classify as delayed?

Recall = TP/(TP+ FN)

F1 Score

This score takes both false positives and false negatives into account and is considered the harmonic mean of precision and recall. While we could take the simple average of the two scores, this indicator is more resistant to outliers.

F1 Score = 2x (precision + recall)/Precision + recall


Designing a test plan for an artificial intelligence implementation is not only limited to validating the test scenarios and comparing it with the expected results. We are evaluating the ability of a model to carry out a prediction for which it was intended and programmed, of course, the preliminary results are good indicators, but not the only ones. The challenge is to validate the learning capacity when the model is deployed to production and begins to process real-world data.

The quality strategy starts with the definition of scenarios, the classification of the input variables, the validation of the data, and the final evaluation of the model.

How deep we are going to get involved in understanding how each model works depends exclusively on us, although in this article we have explained an approach based on “black box” testing for a supervised learning model, understanding the mathematical foundations that underlie each machine learning algorithm, will allow us to move to a “white box” approach, for example, improving the model performance by “tuning” the hyperparameters, this will allow us to provide more comprehensive feedback to the Data Scientists and developers.

Every step we can take in understanding the fundamentals of machine learning is a very important added value to the testing strategy and will help us to improve the quality of the delivery.