Test Automation Demystified, Part 6: Scenarios, or Why Some Automation Projects Fail

19 min readJul 24, 2019

Probably you have seen such a graph before. It compares efficiency of manual vs automated testing.

It shows that test automation requires some additional initial efforts, but then you spend less time compared to manual testing. This is very rough approximation. In the real life the picture is different:

You may see that while each next graph is more realistic, it shows worse picture for the Test Automation. For example, the red line on the right graph tends to grow over the green (manual) line making total effect of the whole automation effort negative. So it is not uncommon that the real situation looks like that:

Luckily this situation is avoidable. So today we are going to go through typical reasons causing the loss and ways to avoid them.

The graphs like that show overall automation efficiency. The time here is a sum of times needed for each automation scenario. So what happens on the scenario level?

For manual tests efforts look like that:

Here we display efforts per iteration for a single test scenario. We assume that tests are executed on some regular basis (daily, weekly, per sprint or even per commit) and ‘Cost’ means all combined efforts and hardware costs required to execute a test case. For manual test cases they are more or less similar. The first iteration may be higher, because you also need to write test steps, but afterwards they are close to each other and don’t radically change with time.

Now let’s see what happens with automated test scenarios. First is ideal case:

We invested time on developing scenario once and then it runs automatically. Initial investment is higher than with automated test cases, but then it runs for itself making profit on each iteration compared to manual testing. But this is an idealistic situation. In reality it is still good if efforts distribution looks somewhat like this:

So you spend very little time to maintain the test case. Small spike is some trivial fix (like some button has been renamed in the application and this was rapidly fixed in the test). Next spike is some more major change (application functionality changed in a way that required update to the test scenario).

And last but not least is an example of what unfortunately happen in the real life. Here is a “troublemaker”:

This test scenario requires permanent maintenance. On some iterations the maintenance may take even more time than initial test creation.

We want test set to contain “Ideal” or “Good” test cases and we need to avoid Troublemakers to minimize overall loss.

This is our primary goal when we plan the scenarios for the whole testing project.

So we are going to talk about underlying reasons making test cases to belong to specific scenario type.

Test Cases

Maintainability

What car would you prefer?

Well, unless you are fan of vintage cars the answer is evident.

But what if I change the question: what car would you prefer if you know you will have to maintain, fix it yourself, in rush and in unexpected circumstances?

In such condition the answer may be different, because old Ford T is known to be repairable. We see that all of its parts and units are clearly accessible, engine has special service covers and is set at comfortable height.

So in our case the whole test set should be as advanced and effective as electric T. But to get this we need each test scenario be as simple, clear and fixable as old gasoline T.

Is Crush a Problem?

Crash? 500? 404? Is it a result of the test?

Instinctively we feel that if someone found a crash then it was a great catch. Crash looks like something more important than typo.

If you fan of the testing theory and read magazines and blogs, you will see that from time to time vendors of testing tools promote some new method of testing: code analysis, memory checking, monkey testing and so on. To prove its efficiency they take some public product and tell how many ‘critical’ issues were found. Usually they mean crashes, 500 or 404. It looks as a great work, but if you look at further state of these issues you may see that many of them may be never fixed or fixed with very low priority.

This means that the crash is not critical for end-user. First of all, it may be easy to recover it by doing refresh. Or it is so hard to re-produce by end-user and has trivial workaround so there is no problem. Thus such a problem may be marked to be a “low severity” issue.

However such a crash is a critical problem for automation. Your test may stuck in the beginning and never reach its goal. Half of your tests may be blocked just because of such an issue.

The most annoying problems are intermittent issues. They reproduce with some probability and cause tests to be flaky. Such problems are very hard to fix and may stay for a very long time. Automated tests should be able to survive or bypass them using workarounds.

You may compare it to manual testing, when such a problem will have same severity both for you and for end user.

Base State

Once you automate something it is always good to return test case to the initial state: start page, user home screen, dashboard, etc.

If test creates some temp data such as patient, work order and this patient is not needed it is good practice to clean this value in the end of the test case.

It is good to clear state both after the test case and before the test case. A broken test case may leave the system in the intermediate state.

For example, pizza ordering app may have ‘create order’ and ‘cancel order’ features. Suppose that the test checking ‘cancel order’ failed for some reason. So the order is open and application locked until you complete the order or cancel it, it is typical for mobile apps. And since application is locked other test may fail.

As we said other tests do expect that no orders are currently open, but one has suddenly failed and didn’t clean up for some reason.

Good solution is to have common big ‘clean up’ function and call it in the beginning and in the end of each test. Its goal is to make sure the application is in the predefined state. If it is not, then return it. If it is not possible then just fail the test without doing further execution and with clear report message that application is in a messed state and further checks make no sense. Once such function is available, it is shared among all test cases and shared everywhere. For example, Rapise has TestPrepare and TestFinish functions for this purpose.

Test Case Length

Each Test case should be short. The shorter the better.

Reason is simple. If it fails you don’t need to spend too much time to make sure it work when fixing it. Always assume that you will run the test scenario at least 20 times while creating it. And same number of times when fixing it. So if the test case takes 1 minute then you will spend about 20 minutes running it while fixing. But what if the test case takes 10 or 20 minutes to run? The time you spend to maintain it will grow accordingly and this is a wrong sign. Except the cases we will mention in the next chapter.

Data Driven And Looped Test Cases

Sometimes length is not a problem. This happens when test is data driven and properly designed.

By ‘properly designed’ I mean: each iteration of input data should be clear and reproducible. So if test fails for some row of input data it must be clear from the report. Also we should be able to launch the test case only for this row and re-produce the issue.

Test Script

Test case script should be small so one may get its logic. Short, simple, stupid.

Stupid is last but not least. Some testers (programmers) try to utilize advanced scripting constructs to make some checks or operations shorter. Good example is regular expressions — it is too easy to make them hardly readable:

Test is a wrong place to demonstrate such skills. You never know who and when will have to read this test scenario. If long time passes, it is usually very hard to read regular expression and say what it actually mean.

Of course, sometimes you have to use it to check something. In such cases it always good idea to make such a check inside some re-usable function (i.e. function checkValidEmail(addr)). So you may call this function alone and debug if you have doubts about its functionality.

Let’s Click Everywhere

What happens if you have an application and just tried automation after many rounds of manual testing?

What people see is speed: UI test may do 10 clicks per second and it all looks like working at the speed of light.

Those who are familiar with programming and scripting usually raise an idea to try clicking all controls in all combinations. To find, for example, crashes. What happens next? Usually we quickly find that ‘all combinations’ is higher than we can realistically click even if we can do 100 or 1000 clicks per second. Even for one page application with 10 controls it is 3.5 million combinations. Looks not too much, but you will probably have a test report and it will contain 3.5 million lines. Suppose you found a crash and it reproduced somewhere at line 2 000 005. How to reproduce it? How to narrow it down? What tools to use to open report?

So in our point of view spending time on doing any ‘click everywhere’ solution is mostly senseless. Even if such an approach works and finds something the results are very hard to analyze. In many cases what it finds never fixed because it is hardly possible to be done by user.

Redundant Validations

There are many awesome things you may do in automated tests.

For example, we may implement invariants. Such as: on each script action check that there are no exception in the logs. That there are no popup messages on the application screen. Or check that user account balance is never negative.

Such kind of testing is usually applied to very stable systems (hardware, compilers, mission critical software, embedded systems, medical systems). I.e. in the cases when bugs must be found at any cost.

In your system there could be similar part. Today’s trend is to have fast delivery for continuous deployment and in such case maintenance cost usually much more important than the cost of particular issue. So this kind of testing will delay the deployment and introduce complexity to all tests. You need to be sure that there is really a need for that, some strong business requirement. Otherwise it is better to limit to reasonable validations.

External Dependencies and APIs

Your application under test may (and in most cases do) have dependencies from external systems.

In basic case it is a web service, or DLL library. In more advanced 90% of your application is external system. Or mix of systems (such as 3rd party CRM with 3rd party accounting linked together).

In such cases you need understanding of stable and changing parts. For example, the APIs are usually much more stable and keep back compatibility. UI is a subject to change.

Also deployment cost in such cases is very high so you either have to share same staging system for all tests or even use prod for it.

So you should invest more time in learning available APIs in such case to have stable test coverage. It is worth to look at history: how old specific API. The older the better — more likely it will stay intact for back compatibility.

Example is Online version of MS Exchange. It has ‘old’ EWS SOAP-based API working since 2007. And recently they introduced REST based API. So if we have question which of APIs to use I would prefer SOAP. The reason is it has more legacy and harder for the vendor to change. REST is young and vendor may find out that it is not popular and simply dispose it in one of versions.

Test Set Level

System Deployment

Your situation may be radically different depending on deployment options you have.

Sometimes you may install fresh instance of the product for each test set or even for each test scenario. Usually this means that the product is also young and fresh so it is highly likely to change rapidly. In this case you may need to focus on few test and some generic tests (like log analysis for errors and exceptions).

Sometimes the deployment is a heavy and long task. You have only one instance of application for all the testing efforts. In such cases your test set may need to have preparation and clean up blocks to pre-fill data and ensure right system state before and after test set execution. Also you need to have planning in such case. If other tests are executed in parallel, or system is being maintained you need to know it and not allow it to be mixed with test execution.

Sometimes you have VMs, but the server running VMs has other VMs used for other purposes. Well, it is needless to say, you need to know what is happening on the server, because one VM under heavy load may affect other VMs and indirectly cause test cases to fail intermittently. It is better to completely ‘own’ such a server to avoid disappointing surprises.

Sanity Checks

Suppose all test scenarios require login. What happen if login is broken and we run the test set with 100 test cases?

Each test fails, it takes some time (say 30 seconds per test case — realistic delay for UI test). So we will get final test report in 50 minutes and it will have big test report containing 100 failures.

This means we just needlessly lost 50 minutes and also got report that is big, slow to operate and harder to understand.

So it is evident that we would avoid this if we have very small, very basic test set checking major features such as:

1. Common functionality

2. Database connection

3. Whole system status (active/under deployment).

4. Check external API availability

5. Check memory and disk size (!).

Such a test set may run very quickly and give you an alarm if the system is not ready for execution. So if sanity checks were failed no need to execute whole test set before all showstoppers are fixed.

System Maturity

If you know that something is a subject for active development then no need to invest too much efforts in making comprehensive automated test. It will break. The only exception when developer are involved in writing such tests. So the test works for mutual benefit: developer knows that he/she didn’t break essential functionality and QA has automation logic maintained.

In general it is worth to have deeper automation of mature parts of the system and more basic testing of the part under active development.

More mature system — deeper tests.

Data Feed

Sometimes the test set depends on the data entered into the system. For example, specific patient data may then be used to test a system.

If you have unit tests made by developers you may notice that data is in many case very weak. Developer may not understand what the software as a whole is going to do. The data is dummy and focused on making some specific method run.

So having good data close to real life is a task for QA, the person who knows the system usage better. And it is the art of testing. You may have 10k data entries for testing and capture, say, 1 issue. And you may have just 50 handcrafted rows catching 5 problems. The numbers may be different, but the ratio in effectiveness is not so unreal.

The amount of test data also make real sense. We know that there is something like ‘good’ amount. I.e. having just 1–2 contacts in the system is not enough. Having 100k contacts may be too much (search, filter, dropdowns — it all will be slower). It is good to have such an amount for performance testing, but for other tests it may be an example of the redundant test data.

Typical Scenarios

There are some well-established testing patterns worth re-using. The strengths of such patterns are good performance, common procedures, shared code and ease to understand when you have to fix something. If you ever implemented such a test you will probably re-use it in the new project.

Security Testing

If you system has ACL and roles it is worth to have single security test. You just need to know what is it and how to use it. We used this approach to test own Spira family of tools withRapise. We had a webinar on how it is done with Dynamics AX:

Webinar Recap: Security Testing in Dynamics AX 2012 with Rapise

November 30th, 2018 by inflectra On November 29, 2018 we hosted a webinar on automating the testing of yet another…

www.inflectra.com

And the framework is available for download from GitHub:

Inflectra/rapise-dynamics-samples

Sample tests for Dynamics AX and Dynamics NAV desktop applications - Inflectra/rapise-dynamics-samples

github.com

Load Testing

It is common need to check concurrent execution. For example, check that system will work with 100 concurrent users without crashes and delays. Some test engineers try to execute an automated UI test in parallel in 10, 50 or 100 hosts. Unfortunately this approach is heavy, expensive (running system should have equivalent of 100 cores to run 100 selenium grid nodes) and it is fragile because UI approach tends to be more flaky.

Also execution reports are hard to analyse due to size and single-thread structure. It is missing essential features like sync points. I.e. when 50 users fill all initial information and then press the Login button at exactly same moment. Another essential piece of such testing is server monitoring (check logs, memory error on the web server and database).

Surprising news is that the same kind of testing may be done with very low resources using load testing tool (such as NeoLoad). Such kind of test with 100 users will require only one CPU core, may be configured for ramp up, server monitoring and have many other features of load testing.

So load test should be designed as a load test from the beginning with utilization of appropriate tools. Rapise has dedicated NeoLoad integration to help developing a load test from UI test, but it is only one step and then the load test needs special treatment to give adequate picture of the application load capacity.

Tests and CI

Some tests are not for CI.

CI implies unattended test execution. Primary goal of test set execution in the scope of CI is fast and sufficient validation. This differs from discover, deep testing, performance testing, crash testing, limitation checking.

When CI has 100 of tests then this means 90 unit tests and 5–10 integration tests. The success of the testing depends on the system maturity and test stability. Some tests are fragile because they must be fragile. Some tests help cementing stable functionality. So it is more task of aligning test efforts with application development rather than making maximum number of tests.

So if you have CI and existing tests you still need to select which of them to execute as a part of CI process. You may have group of tests worth executing always, some of them only for major release, others only nightly when there is enough time to wait.

Summary. Test Framework Development

So far covered the question of what to include into particular test, what to avoid. But there is still the question how to make a test set? How to split scenarios, use cases, stories, requirements into proper test cases?

This area is one of typically obscure. Often the list of scenarios is made without implying automation in mind. Or, manual test scenarios are taken for automating one-by-one.

We recommend to have some presentation of all product features. Prioritize some of them. Then map these features to particular test scenarios.

Suppose that we have some CRM system. Since it is a modular system and modules have similar functionality, we may roughly combine its features into the grid like that:

Uncolored cells mean feature absent in the given module (you may promote a lead and convert it into contact, but you cannot promote a contact). Also there is one ‘Login’ feature for all modules. So it is grid-like structure is not a grid. In real life we usually represent functionality by many different ways, including requirements, diagrams, sketches. So here we use this grid-like representation for simplicity and clarity of illustrations.

We have manual test scenarios. One of them looks like that:

It mimics basic contact engagement workflow: from catching a lead to converting it to Contact and Account. The ‘Leads’ module itself has own functionality: search, create, promote etc. So here ‘v’ means which functionality we touch while promoting a lead. We check that lead module is available for current user (security), we search for similar lead (lead search), if not found we create a new lead (lead create), then we promote it to a contact and account (lead promote), then we edit new contact and account to fill address and billing info (contact edit, account edit), check that new information is displayed correctly (contact view, account view), delete an old lead (lead delete).

This scenario is good for manual testing. It combines reasonable coverage of features and saves some time: we check both search and lead security without doing re-login.

So if we have existing set of scenarios, then we take manual scenarios and try to automate them one-by-one, what may be easier? We think it is a wrong approach.

Since execution automatic we may split the tests on per-feature basis. I.e. just cover a piece of functionality in a single module:

We use ‘VVV’ for individual feature’s check to mark that we test it comprehensively (validate feature and its side-effects). ‘v’ means we just use this functionality.

In other cases when there is well-established approach for testing it is recommended to follow it. It may likely be that this kind of automation will be more effective:

This scenario covers the pattern “security test” . Sometimes it is more effective to do it that way. We do login for each module, and validate appearance of the required menu items and buttons specific to some user role for each module. We do log in as each role, but we don’t validate login logic, using minimum required functionality to enter the system as another user (‘v’). The power of such approach is in spreadsheets where roles and expected access controls may be defined in one place.

The reason is maintainability — such a test is easier to fix and easier to understand when something fails. We use the power of automation — logging in and out for each feature is not a problem.

One additional sign of a maintainable test is limited number of objects. Usually the first thing that breaks when application changes is object recognition. So if object is shared between many tests, these tests are affected by object change.

Rapise also allows shared object repository for the cases where you have to share objects between multiple test cases, but unless it is needed isolation of small scenarios with few object may be simpler approach for beginning.

When building test scenarios the following information may also be important:

Grey color here means some unstable or changing features. For example, if search is done via 3rd party web service and we know it to be buggy then we need isolate the tests involving it. I.e. avoid using these features in more complex test cases. If we use such features, then try to not use more than one of them in the same test case. If we have to use them, then make commonly re-used functions with re-try to bypass known problems and not let problematic functionality to be a blocker.

When we consider whole test framework development, we assume that it takes days, weeks or month to complete the test set. And we have choice what test scenarios to implement first. Good result is that during these days and weeks we can execute already developed tests — they are valid from day one. Thus we recommend to delay implementation for features known to be grey until other features are covered. So we have maximum unload from manual test case execution by the moment we stuck with really tricky test cases.

This leads to another consequence: sometimes covering features module-by-module is also not effective. I.e. first do all tests for module ‘Leads’ then do ‘Contacts’ etc. Since ‘Leads’ contain a grey feature, we would better keep it manual. While ‘white’ features are still available, we would better focus on them.

Otherwise we may be unlucky to have an automated test case that requires daily fixing and takes time from covering more promising features with more fruitful automated test cases (i.e. we may create a ‘troublemaker’).