AI for Testing: Identifying App State
You know that feeling when you see a new app for the first time? That feeling of vague familiarity? That is because many apps look and behave pretty similarity, and that is your brain pattern-matching with all the other apps it has seen before. Yeah that new page looks like a login page — I know what to do here. Oh, that looks like a shopping cart page, I’ll look for the checkout button. To teach a machine to use and test all apps, like a human, we need to instill this same sense of recognition and intuition. How can we do that?
The machine needs to be taught two different things to replicate the human tester’s behavior: Recognize the state of the application, and know what inputs are valid or interesting for this application state.
Recognizing the state of an application can be accomplished by giving the machine many screens and labeling the screens as a type of application state — login screen, search results, permissions dialog, etc. If we can teach the machine to recognize the state the application is in, much like a tester recognizes the state of the application they are testing, the AI testing machine can intelligently pick what types of inputs to apply to the application.
First, we gather many thousands of screenshots of real world applications, and the respective metadata that describes the context of each screen, element, and datapoint. We then consider all the major classes of state that applications have. Luckily for machine learning, most apps have similar states, which means we can have many examples for ML training data.
Second, we need to have humans apply the labels to each screen. A simple application is built which shows one of these screens alongside a list of possible labels. The labeler, a human performing this labeling task, clicks the appropriate labels for each screen in sequence.
Once all the labels for each screen have been saved, there is now a corpus of labeled data with which to train the ML. Again, the detailed mechanics of this work are beyond the scope of this paper, but suffice to say that thousands of screens are shown to the ML program, where the input is a combination of the pixels in the screenshot itself, along with the DOM elements. The training/test data is the set of labels we gathered for each screen from the humans.
ML Training Input and Output: (Image, DOM element information, expected labels) tuples
The training process can require hours of computation. This production system leveraged Convolutional Neural Networks with the training system showing the ML machine screenshot after screenshot, determining whether or not the ML successfully labeled each screen. Every time the ML gets the label for a screen wrong, it changes its internal structure, and the process is repeated until the ML does the best labeling job it can do.
It is worth noting that training data that is sparse, incorrect, or inconsistent can prevent the ML from learning.
We can now determine the label for the screens we have trained on, and more importantly determine the correct label (application state) for screens the trained network has never seen before — just like a human tester.
Robustness of State Identification
When the ML training is done, the quality of the machine labeling is often visualized via a chart that is called a ‘confusion matrix’. The confusion matrix simply plots each screen label on each axis and shows how often one screen is confused for another — a mistaken label. The results here are from early work after training with about 6,000 screenshots. The production version now leverages over 300k screenshots, some of which is training data generated by various morphs of the smaller training set to ensure the model is robust to color/size changes, etc. This approach works fairly well, as lighter blues and smaller numbers mean great labeling.
I hope this visibility into real world AI testing systems inspires more developers and testers to consider using AI in their daily work. A little secret is that it is far easier to build AI “classifiers” for application state versus hand-crafting heuristics into procedural code. Better, the AI versions get smarter over time as more samples are collected. Even better, the AI version effectively keeps all ‘300,000’ images, from thousands of different apps, in its ‘mind’ at the same time when determining the the state of an application. I know some very smart testers but no human can do that. It is magical. It is like having a tester that has the experience of testing thousands of different applications, thousands of them can work in at the same time, and they never get tired and start watching YouTube in-between test cases.
Next up: How to teach AI to apply intelligent test inputs like taps, clicks, swipes, text input, given the current application state.
— Jason Arbon CEO Appdiff.com