Test Automation Demystified, Part 7: AI in Test Automation or When to Expect the Rise of the Machines

17 min readAug 7, 2019

We all know the table ‘Levels of Automation for On-Road Vehicles’. It introduces easy to understand high level grades to distinguish levels of maturity for self-driving cars. We hear a lot about AI in software testing. Many speakers talk about remarkable achievements and while talking refer to robotic cars as examples. We think it is time to introduce similar kind of levels for software testing. We will propose the model with 4 levels: with Level 1 — manual testing and Level 4 — fully automatic exploratory testing. We will also try to explain current level of industry and make estimate when to expect level 4.

Automation Levels

I believe you hear it many times this statement:

Statement 1: “AI is a future of testing”.

Somewhere in the same speech you probably also heard something like:

Statement 2: “Just 10 years ago driverless cars was on level 0 or 1 of automation and now they are at level 4 (High Automation) or 5 (Full Automation)”

Our current level in AI-based testing is similar to that of driverless cars some time ago. Some technologies exist, some solutions are on the market, but where we are now? Level 1? 2? Or higher?

The feature of such a scale is a way to easily understand the overall industry level for non-technical people. We see a lot of speculations about AI in testing. We read many papers and see demonstrations. And in many cases we see that the word “AI” or “Machine Learning” is used as just a buzzword and means some old trivial feature. Or it may be minor feature (such as syntax checker for code comments), however the whole product is declared to be AI-enabled. This is a marketing trick making it is really hard to understand how important it is to have AI and what role it actually plays.

Can we just map level definitions from cars to software? Maybe we may then apply similar scale and find where we are on this path? We want to have bird’s-eye view to better understand the propagation of AI technologies into the software testing world.

Terms Mapping

So what intrinsic terms or notions do we have when we speak about driverless cars?

Basically, in the cars world there are the following key notions:

The goal for the driver is to get from address 1 to address 2.

Maybe you know what is the mapping for “The Road”. For me it is not so straight forward, but let’s try to figure out.

What is the “Road”? It is something including:

1. Markup (lanes)
2. Road signs
3. Map

All 3 are essential and always present and used by the robot to drive.

We may find analogs in software like that:

1. Lanes are Forms, controls and API entry points.
2. Road signs and Maps are specifications and documentation.
3. Route is test scenarios.

So the testing goal is to complete all scenarios checking all specifications and covering forms, controls and entry points.

The Road

We just mapped the definition of the Road to something in the software testing world. We do know that there is a well-established public transportation infrastructure around and we may rely on its existence while building the robotic driver.

How much of it exists in software world?

1. Forms, controls and API’s are essential part of software. They do exist for each application.

2. It is not so clear with road signs. Sometimes we have partial specifications of some components. It may be human readable (for programmers). It is good if we have requirement management and have some requirements. Sometimes we lucky to have some formal specs, something closer to Cucumber or even Gherkin. In most cases it is partial and free-form (emails, paper sketches or even verbal requests).

Imagine similar situation for car driver. There is a road, but there are no signs if is it one-way or two way. Neither we know what is the speed limit (everyone in this area knows). All locals and police know that the road is one-way. But it does not expressed anyway. Definitely that would make robotic driver’s work more challenging.

3. Scenarios. Sometimes we have formalized scenarios. Sometimes just manual test cases (that are known to be not so good for automation) described as human readable steps. Also we may have use cases. There is also history of closed bugs for regression.

To summarize this. We have software that needs testing. But presence of specifications and scenarios is not guaranteed. Exploratory Testing is a good example: we should create a specification and scenarios automatically while testing a software.

AI Tester Model

Let’s try to proceed using analogies with car driver. We may visualize the driver like that:

It is T800, and it has everything needed to work as a driver:

1. Arms to control the wheel and transmission.
2. Legs for pedals.
3. Head with sensors to collect road information.
4. Ears to get information about the destination.
5. CPU to perform all driving activities.
6. Maybe it does not have GPS in the default version, but let’s assume it has it in our modification.
7. Also lets add 3G or LTE to access maps.

As we know from the movie T800 was able to use cars to find Sarah, so we think it is a proven model.

Let’s see now what we need for software testing taking T800 as a base. Basically we need eyes to see the screen:

And arm to use keyboard and mouse:

So here we have it all assembled together:

Well, this is supposed to do what we need. It has eyes to see the screen output, it may interact with software using hand. And, maybe, we need just one more thing. We need to know the test verdict. Let’s add more common detail to show the final status:

I’ll call it Arnie. So once Arnie knows what to do, it has everything to proceed. We may even simplify this model. This is well known form factor that has all we need:

It may also do API testing through software interfaces:

R2D2 is good for testing, but I still think it is not complete and Arnie has one vital feature. For the exploratory testing it is expected that tester interacts with users and developers to better understand the system and collect essential information:

This is where hand is more useful. Same approach is also valid for resolving possible ambiguities.

AI Tester Goals

So what AI-based tester should be able to do? It depends on what you have and it depends on the task.

The core problem is that we don’t have clear goal. It is very flexible and in many cases it is implicit.

Simple example. AI’s goal is to replace human. So suppose you are mocking an AI right now (just like you may mock robotic driver by driving the same car yourself).

So here is a button. The goal is to test it:

I expect that you would ask: “What do you want me to test?”. Ok, the goal is to find bugs. Please, find bugs in this button. Anything? Yes, it has 4 holes, it is not round enough.

Is there any bugs in this metallic button? Probably it is not metallic? Or it is? Here we will need to interact to check.

Is there any bugs in this wooden style metallic button?

Is it production ready?

Is it production ready as a 5 meter “The Wooden Button” metallic memorial?

What is Testing

What is software testing? Is there an answer to that question? On some training session on formal testing we did this research: we asked attendees to give own answer to this question. The problem is that no common answer ever found.

We all think we understand it, but if we asked to give formal definition and compare it with other’s answers we see the drastic difference. You may try it yourself.

So what is testing? Here are some most common options:

1. Finding bugs.
2. Looking for crashes.
3. Checking that application can be used for its purpose.
4. Checking that application confirms to the specification.
5. Deciding if an application may go to the production.
6. Validating that application is secure and does not expose private or restricted information outside (security testing).
7. Checking that application is able to serve required load (load testing).
8. Comparison testing (to similar product).
9. Regression testing (compare to previous version of the same product)
10. Localization testing (product works for multiple countries).
11. Usability testing.
12. Documentation testing.
13. … more …, much more.

Every type of testing makes no sense for some cases. Every type of testing makes perfect sense for some cases.

For example, “crash” sounds as a problem. But. How many times we saw that crashes are not real problems for users. User may refresh page and move on and does not thing something really bad happened.

Even this is a testing:

It is also important to understand if system is “white box”:

or “black box” and should be controlled from the outside:

Since there are many different tasks related to testing and many different meanings, we still have to start from defining the word “testing”.

AI Tasks in software Testing

There are lots of things that need to be automated and many of them are not related to each other. We roughly divided all such activities into 4 distinct swim lanes.

Also we assume that each swim lane has 3 levels each meaning

1. Non-AI
2. Partial AI
3. Full AI

This set of automation levels may be much more detailed. Also we may have more swim lanes. However we want to be reasonable rough so that minor details don’t hide the full picture that we want to see.

Swim Lane 1: Software Interaction

1. Pure Manual Testing.
2. Playback recorded or hand-scripted interactions. Automatically recover locators of some elements (recovery, self-healing).
3. Interact based on high level instructions or human readable text (fill form, login, select grid row). I.e. able to find elements like a real person using visual methods.

This swim lane is most well developed. There are awesome and remarkable achievements. We may state that overall industry level here is 2 and we are moving to the level 3.

Swim Lane 2: Scenario

1. Pure Manual Planning. In many cases this work is done on test architect, management or individual tester level.

2. Partial automation. Generation of some scenarios based on formalized specifications, traces, logs and any other machine-readable information. Partial input data generation (maybe based on model).

3. Full. Scenarios are automatically created based on model, specifications and requirements expressed in arbitrary forms and from interactions with the software.

State of the art is that we are mostly at level 1. Some technologies reach level 2. In most cases scenarios are generated by analyzing the formal specification (i.e. VHDL or Verilog for example) or made by analyzing execution logs (web). Another known approach is a monkey testing, but it more like random approach.

Explicitly we have a model in mind. So if we asked to test this door:

We will see that the door is OK. The problem is around the door, and we see collision with internal model of the door that we have in our mind. In this example, and examples below:

We detect a problem based on our ‘common sense’, that is a model of how things should behave is planted in our minds. So even if this model is implicit, it is required to test applicability of something.

Swim Lane 4: Goals / Targets / Criteria

We need to know when to stop testing. Once we did it we need to understand how well we tested the system.

1. Testing goals are defined by test engineer.
2. Testing goals are partially detected (sub-tests that should pass, % code coverage, time limits). I.e. AI helps to define conditions when to stop testing and Release/Launch/Ship etc.
3. Testing goals are set completely automatically. AI tells final result (RELEASE/Launch).

On this swim lane we are between 1 and 2. There is a reason why this swim lane exists at all. Ultimate universal answer would be ‘test all’. Provide all inputs in all combinations. The reality is that exponential growth leaves us no chance we can do it.

There are formal approaches to solve this (such as CTL https://en.wikipedia.org/wiki/CTL ) where the program is proven to be correct, but we agreed not to dive into details…

Levels of AI in Testing

Here is my composition of testing levels form Level 1 to Level 4 based on the progress in each of the swim lanes:

Or, in another form:

This is current state of the art for existing solutions:

“Self-healing selectors”, “Visual Testing” and “Find Elements like real person” belong to “Interaction” swim lane. We may see that is it the mostly developed one. Monkey testing belongs to “Scenarios”. While “Specification” and “Goals” are still defined manually.

We see that we have good level 2 for the industry of the software testing as a whole. We have 10%-30% coverage expected for Level 3. Here I take into account all the developments related to AI and recent achievements gained after raise of Machine-Learning based methods. Some of them are on the market, while others are about to appear of on the market. So unless we see another breaking through new algorithms the state described here will stay for the next 5–10 years.

Infrastructure

“Current” robotic car driver is less sophisticated than T800. It is easy to prove. The Robot driver requires maps with lot of details, traffic information and GPS to be available to navigate successfully. T800 was able to automatically collect address of Sarah Connor and find addresses using paper maps only and yellow pages book.

So not only robot driver became better but also the infrastructure has been created to facilitate its operation and this is a major global change.

So the better the infrastructure the more chances for AI to deal with it. For example, the first line to be operated with Automatic Train Operation (ATO) was London Underground’s Victoria line, which opened in 1967. And another useful piece of information here is well known standard covering levels of ATO from Grade-of-Automation 1 to Grade-of-Automation 4.

Another well known example is auto pilot. First auto pilot system has been tested in 1930:

Autopilots in modern complex aircraft generally divide a flight into taxi, takeoff, climb, cruise (level flight), descent, approach, and landing phases. Autopilots exist that automate all of these flight phases except taxi and takeoff. An autopilot-controlled landing on a runway and controlling the aircraft on rollout (i.e. keeping it on the center of the runway) is known as a CAT IIIb landing or Autoland, available on many major airports’ runways today, especially at airports subject to adverse weather phenomena such as fog. Landing, rollout, and taxi control to the aircraft parking position is known as CAT IIIc. This is not used to date, but may be used in the future. An autopilot is often an integral component of a Flight Management System.

https://en.wikipedia.org/wiki/Autopilot

Last paragraph is a citation from Wikipedia on ‘AutoPilot’ and what we get from it is that since 1930 avionics has been greatly improved. There are lots of helper systems and standards. There are infrastructural things such as “Autoland” available in major airport, that helps with landing. However what I was unable to find is some sort of “Autopilot levels” helping me to understand when the flight may be held without a pilot at all. Even taking into account that first parts of it was working back in 1930 there is still no near-future perspective to see it. We already see success in reducing the crew to 1–2 pilots for big planes, but no sign of when we fly without pilots. Major difference with trains and cars is inability to stop in the middle of the flight in case of unexpected situation or out-of-range values coming from sensors.

So how it compares to Software Testing?

In general, we see the tendency to have fully automatic control everywhere whenever possible. However, the possibility may depend on key infrastructural features making some things possible. The more common standards and regulations are present the better results the AI may produce.

Also it is important to see how many distinct phases software has. I.e. the system that automates taxi of a plane is different form that involved in the landing. And while one may be already declared as 99% complete, the other may be 80% or less and required the global infrastructure to grow all over the world to improve from 80% to 90%.

That is why we hear so much about break through and AI-enabled solutions in various types of testing. But the whole complexity and fragmentation of the task hidden under the word “testing” still keeps us very far from the destination point “Level 4 — Fully Automated Exploratory-style testing”.

We think the industry is now on the Level 2 in the proposed scale of AI in software testing. While there are remarkable achievements, new technologies and standards appearing every day, there is still a lot of things to do before we get to the point. To be short: we are close if there is easy to understand well known standard to measure the level of AI just like for trains and automobiles. I.e. “birds-eye view” on the industry. Otherwise we are still at exploration global phase. So every time the topic starts with with “AI” and then reduces to narrow and very specific terms (machine learning, deep learning) and examples from Google/Tesla/IBM/Uber/Amazon then most probably it is yet another machine-learning based solution covering just one more task from thousand different tasks.

Influence of AI on the test automation differs depending on the considered swim lane. It is likely to improve in “interaction” part — object recognition, image comparison. It is not so good for scenarios and specification and much worse for goals and criteria. So last three swim lanes nowadays require significant manual interaction.

Changing Environment

Consider one more characteristic feature of current software development. Waterfall, RUP, SCRUM, Agile, DevOps. This means the software development changes. Typical development cycle reduced from years to months, then days and sometimes hours. The ready to use product may appear in as a result of 24h hackatone.

In many cases the testing actually skipped to end users due to simplified shipment and update process. In one hand, this way testing is delegated to end-users. In the other, frequent releases means more testing work overall so still a lot of efforts are required.

This makes testing different. Also the notion of ‘Flacky’ test make it questionable if we may achieve 100% automation even if each automated test is developed manually by top possible engineers.

Is Machine Learning Going to Help?

Sometimes it looks like Machine Learning and all the methods around it (Hidden Markov Models, Neural Networks) may do anything. It may hear and understand your voice, it may enable computer vision, it may play GO game better than human. It may replace faces on the Video! Recognize human on the photo. All you need is lot of data to train it.

Although it seems that AI may do everything, it is worth to mention that it is not so universal. For example, automatic stock exchange trading. Lots of methods there, lots of data of stock rates in the past but still no ML-based solution that is making you reach although it is one of the first tasks that people were trying to solve with ML.

Another case is simple arithmetic calculator. It is easy to have training data like:

1+2=3
2+2=4
….

And train the neural network. Some courses on ML have it as a learning task.

It is evident is that microprocessor may do it more effectively. Where “more” is millions of times more. And regular algorithm easily outperforms it.

Another known trick is music. You may be heard that “AI was used to write music”. In fact algorithms for writing music are well known (since there is a theory behind that). It is impressive, but it is hard to say if music written by computer in 2020 is anything better than music done my simpler algorithm 1980.

What we know that neither then nor now computer was able to write a simple joke. Something that we would just citate here and you read an sincerely laugh. There is neither good theory, nor AI methods I’m aware of.

So When is the Raise of the Machines?

I think I know the date. It was stated by Sir Antony Hoare (the guy who invented Quick Sort, cross-process messaging, ‘null’ pointer and many other more important but less known things). Well, anyone who studied computer science should have heard his name just like Donald Knuth, John von Neumann or Alan Turing.

I was attending his inspiring lecture on ‘The Verifying Compiler’:

Reference: ‘The Verifying Compiler: A Grand Challenge for Computing Research’).

The compiler that makes correct verified program that is guaranteed to be correct.

The Q&A question was: so when do you expect the compiler that makes a proved program from roughly defined input request to happen? His answer was like “No one knows exactly, but somewhere around 50 years from now”. It was 2005.

Let’s assume that it is an estimated deadline from world class expert. So we expect 2055 when we may expect both programmers who produce bugs and testers who find them to step to the side and let machines do their work.