Testing AI: Supervised Learning

People are adding “AI” to everything these days —but the big question is how to test it. We are already seeing AI products go awry. Engineers working on AI fear the little Frankenstein minds they created might make some bad decisions. Like traditional software, this new AI-based software needs to be tested. Let’s explore how to test our future overlords.

There are three major phases to testing AI systems: training, relevance, and integration.

Testing Training

Training supervised AI systems is relatively straightforward. For example, you decide you are going to train a neural network that tells you how likely an image is a picture of a button, or a text box, etc. To train the network, you get a lot of pictures of each type of thing you want it to recognize. So you get lots of pictures of buttons, text boxes, etc. Then feed that into the training system.

But wait, we already forgot to test something — the training data! Garbage in, garbage out. There are important questions to ask about the training data:

  1. Label Correctness: Training sets are often large, and labeling all those samples is done by fallible humans. Sometimes a button is labeled as a text box, or vice versa. Worse, sometimes a button may look like a little like a text box. It pays to double check all those labels so we don’t confuse the AI.
  2. Labeling Relevance: This isn’t always obvious, but the humans labeling the data should be representative of the type of ‘thinking’ you are trying to train into the AI. E.g., if you train a search engine AI using only non-engineers, you end up with a search engine that works well for normal people, but not great for engineers.
  3. Good Sample for Training? Does the training data represent the types of data expected in production? For example, if the training data only had square buttons, but buttons in the real world also had rounded corners or some were even circles, the AI might not recognize many buttons in the real world after training. If the training sample is wrong, it is likely the resulting AI will be over- or under-fit.
  4. Proportional Sampling: If the AI is trained with say 1000 button images, but only 3 text boxes, the AI might just end up thinking those wide, square, white text boxes look like just another type of button. The training set of examples should avoid overweighting any individual type of training data.
  5. Variations: If it is difficult to get enough samples for training, or just in the interest of making the system more robust, great training sets often have variations of the inputs. Many things can be changed with the training images to create variations: random noise, stretching, DPI changing, or converting them to black and white. This helps the resulting AI generalize a bit better.
  6. Dataset Size: This is more of an art than science, but the training set should be large enough for the problem. Generally, the training set should be larger if there is a lot of variation in the training data, conflicting or doubly labeled data, complex features, or how many layers you have in your ‘deep’ network.
  7. Maintenance: More on this later, but this training data isn’t static. Say the next UI design craze is translucent buttons — but you don’t have any of these in the older training set. The training set should be proactively and periodically updated or the AI brain can quickly grow old and miss the latest trends.

Now that we have tested the training data, the images can now be fed into the AI training system. The AI training system has a set of ‘features’. These are little functions that look at the image for things such as its color, size, or how many letters are contained in the image, etc. These features convert an image into the zero to one values the training system needs. Wait, who is testing those features? Feature testing consists of:

  1. Functional Correctness: These features are often traditional software functions and should be tested as such. What if the feature computation for images returns the wrong size for an image? The wrong color? Does it crash, or return strange values with a 0x0 sized image? All those features need classic unit tests or risk the AI training on bad data.
  2. Feature Set: Are there missing features? How does a human tell the difference between a button and a text box? Perhaps we notice that text boxes are always far wider than they are tall. Maybe a new ‘feature’ should be added that calculates the ratio of the width to the height for every image so the AI can use that clue as well. Are some features duplicates? Are some features irrelevant? Testing the feature set makes sure the AI sees what is interesting, and doesn’t see confusing things during training.

After all that testing work, it is time to send all that training data and feature computation to the AI training system. After a few billion calculations, and probably a few hours, the AI is now trained. Are we done testing?

Oops, we forgot something else. The ‘Test’ dataset. This is another, often smaller, set of images of buttons and text boxes, etc. The AI training system needs training data, but also a separate ‘test’ set that it uses to measure how well the AI is learning. More often than you would think, engineers create bad test sets. Testing the test datasets is a thing, and here are some general rules:

  1. Not Training Data: The test dataset should be different than that used for training. If the test data set is the same, the AI will incorrectly ace the exam because it had the answers in advance.
  2. Representative and Balanced: Like training datasets, the test dataset should also represent the types and proportion of images that will be found in production. The test dataset should also be balanced, meaning the test set should not be dominated in frequency by any single label. Generally speaking, avoid having any single label having 10X the number of examples of any other label.

Only now have we done enough testing to start training with confidence. Many stories of failed AI projects can be traced back to missing one of these test steps. Even the best teams can forget. A few years ago a web research team focusing on an AI for the Japanese language announced that they’d developed a 100% perfect search engine — which is near-impossible. It was my team’s job to test and debunk that claim with data. Now that you are an expert AI tester, you can probably guess they used the training set for the testing set. That team went from self-declared victory to a humble technical shame in a matter of hours. AI systems need testing too!

Testing Relevance

At this point, we have trained a little AI brain on well-tested data and features. The real testing starts now. This phase of testing is designed to test the relevance of the AI brain in isolation, and then testing the AI when integrated into the real world product.

The trained AI can now determine if an image is either a button, text box, or something else. AI’s are rarely perfect, so we need to understand the texture of the failure modes. Generally speaking, if the AI gets it wrong, it is one of these four types of failures:

  1. False Positive: This happens when the AI looks at an image of a text box and incorrectly determines that it is a button.
  2. False Negative: This happens when the AI looks at an image of a button and says it is not a button.

It is important to distinguish between these two types of failure modes. There is often an underlying tradeoff between precision and recall when training AI systems.

Recall vs Precision: https://en.wikipedia.org/wiki/Precision_and_recall

Recall is important for systems where false negatives are dangerous. The AI in a self-driving car is a great example. Generally, it is preferable for self-driving cars to stop if they see something that looks even close to being human in the road ahead. It is OK if the AI occasionally confuses a pothole, or branch, for a person on the road and brings the car to a stop. These false positives are OK for such a system because it means it is less likely to miss seeing a person on the road. Some systems need better recall than precision.

Precision is important for systems where false positives are dangerous. Consider the AI for an autonomous military drone. In this case, we’d prefer the system not attack something unless it is extremely confident that what it is attacking is an enemy drone and not a human. It is far better to miss an attack opportunity than accidentally attacking a human wearing a propeller-beanie. Some systems need better precision than recall.

Not only can AI systems predict whether an image is a button or text box, they also provide a ‘confidence’ with their predictions. It is important to test the confidence of the system.

Isolated relevance testing focuses on understanding the texture of the AI’s output. Firstly, it is important to test the AI on real-world examples of input and determine the correctness of confidence intervals. For each input, the AI will predict the correct label(s) with different confidence values. Confidence values are between 0 and 100%. So, if the trained AI is presented a picture of a button, it will say something like ‘90% sure it is a button, and 5% sure it is a text box and 5% it is neither of the two.’ Generally, the more a button looks like a text box, the higher its confidence that the button might actually be a text box, and the lower confidence that it is a truly a button.

To test these confidence intervals, it is essential to measure recall vs precision. Below is an example from the testing of our AI systems at test.ai.

These two graphs plot precision and recall for correct labeling of images. The labels are for things like ‘buttons’ and ‘text boxes’, etc. Each grey line is the curve for a specific label, and the red lines are the average of the grey lines. With recall, you can see that as the confidence threshold goes up, the system is less likely to identify button. The precision graph shows that as the confidence threshold goes up, the likelihood of the system only identifying the button correctly also goes up, as expected.

For one of the sub-systems at test.ai, we wanted to balance precision and recall equally — maximizing the product of the two values. The following graph shows the combination (product) of precision and recall versus confidence thresholds. Each arc below shows how the value of R*P changes as a function of confidence thresholds. Each grey line represents an individual label such as ‘button’, or ‘text box’. The red line is the ‘average’ of all lines. The gold star was the optimal value for the threshold based on human guesses. The red star shows the true ideal value for the threshold — the product relevance was improved by quantifying and testing the texture of the output data. Our original system used only one large classifier for all labeling, and we set the threshold to this new optimum threshold value of 0.42 to balance precision and recall, based on this testing.

The texture of the data showed us that there was a wide variation in precision and recall across different labels. In response, we built individual classifiers for each label. We had to build a separate training and relevance testing exercise for each of almost 300 labels. Testing these new neural networks, we found better precision and recall values for each individual label, and given that each label had it’s own AI brain, we could now set the threshold to the optimal value for each label.

The graph above shows the ideal thresholds for each individual label. There is a wide variety of optimal thresholds, and our ability to test and measure all this data meant our overall classification precision and recall dramatically improved. The columns on the right show the ideal threshold values for individual labels. Labels such as ‘fingerprint’, ‘google’, ‘barcode’ have high threshold values intuitively because the images of these things are quite unique and consistent within the training set. The lower thresholds also make sense, as things such as ‘up_arrow’, ‘down_arrow’, ‘share’ have a lot of variety within each label, and also are close variations of each other. An arrow could be mistaken for a sharing button, and vice-versa.

Some ‘share’ buttons look like arrows.

Testing Integration

Now, that we know the precision, recall, and threshold characteristics of the AI brain, we need to test how that is integrated into the product itself. There are many things to test:

  1. Product Integration: Test to make sure the AI is connected to the rest of the product correctly. Checking for things such as the image sent into the AI brain for classification is the correct image. Not off-by-one in an index of images, not cropped or resized. Testing should also happen where the output of the AI classifier is fed back into the product code making sure the threshold values are actually set as expected, ensuring the correct label value is fed back into the product, etc.
  2. Product Relevance: Perhaps the most important testing, and often the only testing done on AI projects, is testing the entire product end-to-end. Testing real product input data, via the product UI/API, and looking at the output of the product to verify correct, or at least statistically relevant and useful output is generated, is of utmost importance. Can the car drive itself for 1000 miles? Are the top 10 search results in the right order top to bottom? Can the AI testing bot execute a test case against an application it has never seen before? Often this means creating additional functional tests and measures of application-specific relevance.

Functional tests and operational procedures should be should put in place to make sure that the classifier isn’t used on data it wasn’t trained on. Applying AI trained on one dataset, to a different type of data is fraught with trouble for even the best-trained networks today. Below is an example of the accuracy of general image classifiers built by Google, versus the specialized classifiers built by test.ai for identifying components of applications. test.ai thinks my profile icon is a profile icon. Google’s general image recognition AI thinks it is a toilet seat with my face in the middle….

Google’s AI Thinks I’m…

Testing AI-based products is very different than traditional software testing. It is far more focused on data and analytics than it is coming up with creative test cases, manually tapping apps, or writing automation scripts. Most AI testing today is performed by the machine learning engineers, much as most testing was performed by engineers in the old school days or procedural programming. This is because testers with statistics and analytical backgrounds are rare. Even then, most AI projects only test a subset of the things they should be testing. As AI systems become more complex, more dangerous, and more widely deployed, the field will create a role for AI-testing specialists — and it’s likely they will be well paid.

— Jason Arbon, CEO test.ai