Handle flaky TestNG tests using Harness Failure Strategies

Hemanth Sridhar

Published in

Harness Engineering

3 min readMar 6, 2023

PROBLEM

In Software industry, flaky tests are one of the most common problem. We usually call a test flaky when it sometimes, not every time fails on the first run and passes in the subsequent 2nd or 3rd attempt. Not only do these tests slow down the progress but impede releases and costs a fortune on the longer run.

A survey published in 2022 shows that big companies like Google had 41% and Microsoft had 26% flaky tests respectively. The survey also exhibits how these tests decreased the efficiency of CI with approximately 50% of failed jobs that were manually restarted, succeeded on the second run.

In my experience, the most common reasons for test flakiness are :

Test design
Environment issues

SOLUTION

TestNG already provides us the failed tests after the execution in a file named testng-failed.xml. Using this file and Harness’s Retry failure strategy with Mark As Failure post retry action we will develop a pipeline to handle flaky tests.

PRE-CONDITIONS

TestNG — unit testing framework. Let’s assume that the testng.xml is in src/test/resources
Maven — build management. In our pom.xml

<plugin>
   <groupId>org.apache.maven.plugins</groupId>
   <artifactId>maven-surefire-plugin</artifactId>
   <version>3.0.0-M5</version>
   <configuration>
       <suiteXmlFiles>
           <suiteXmlFile>${executionSuiteFiles}</suiteXmlFile>
       </suiteXmlFiles>
   </configuration>
</plugin>

THE PIPELINE

Stage Name : Sanity
Under Sanity stage, we have two Steps
- step 1 : Run Sanity — This will run the first round of tests
- step 2 : Run Failed Automation Sanity — this will run the flaky tests

STEP 1 : RUN SANITY

Script : Run mvn clean test command against our testng.xml

mvn -ntp clean test -DexecutionSuiteFiles=src/test/resources/testng.xml

Add Reports Path: target/surefire-reports/junitreports/*.xml

Failure strategy : Ignore Failure

STEP 2 : RUN FAILED AUTOMATION SANITY

Script : Check if testng-failed.xml exists in our target folder. If yes, we run this mvn test against this xml. Otherwise, we exit success indicating that there are no failed tests.

if [ -e target/surefire-reports/testng-failed.xml ]; then
    touch src/test/resources/failedTests_rerun.xml
    cp target/surefire-reports/testng-failed.xml src/test/resources/failedTests_rerun.xml
    mvn -ntp test -DexecutionSuiteFiles=src/test/resources/failedTests_rerun.xml -DskipCDNGTest=false
else
    echo "No failed tests to run"
    exit 0
fi

Conditional Execution : Always execute this step

Failure strategy : Retry with post retry action as Mark As Failure

Retry failure Strategy in general provides us 3 parameters :

retry count — To mention the number of retries we can perform against the failed subset suite
retry intervals — To provide the cooldown period between each retry
post retry action — To mention what happens after executing all the tests in each retry count

EXAMPLES

Example 1:

Retry failure strategy retry count is 2, with retry intervals of 30s

Let’s assume there are 100 tests.

100 tests will run
- Out of these 100 tests, 50 fail

30s pause

In the first retry attempt 50 tests are executed.
- Out of these 50 tests, 25 fail

30s pause

In the second retry attempt 25 tests will be executed.
- Out of these 25 tests, 10 fail

30s pause

In the second retry attempt 10tests will be executed.
- Out of these 10 tests, 2 fail

The overall pipeline status will be marked as ‘Fail’

Example 2:

Retry failure strategy retry count is 2, with retry intervals of 30s

Let’s assume there are 100 tests.

100 tests will run
- Out of these 100 tests, 50 fail

30s pause

In the first retry attempt 50 tests are executed.
- Out of these 50 tests, 25 fail

30s pause

In the second retry attempt 25 tests will be executed.
- Out of these 25 tests all of them pass

The overall pipeline status will be marked as ‘Success’

IMPORTANT NOTE

Every retry count will run only the failed tests captured in the previous retried run.
The maximum amount of retries = retry count + 1

Handle flaky TestNG tests using Harness Failure Strategies

PROBLEM

SOLUTION

PRE-CONDITIONS

THE PIPELINE

STEP 1 : RUN SANITY

STEP 2 : RUN FAILED AUTOMATION SANITY

EXAMPLES

Example 1:

Example 2:

IMPORTANT NOTE

Written by Hemanth Sridhar