Handle flaky TestNG tests using Harness Failure Strategies

Hemanth Sridhar
Harness Engineering
3 min readMar 6, 2023

PROBLEM

In Software industry, flaky tests are one of the most common problem. We usually call a test flaky when it sometimes, not every time fails on the first run and passes in the subsequent 2nd or 3rd attempt. Not only do these tests slow down the progress but impede releases and costs a fortune on the longer run.

A survey published in 2022 shows that big companies like Google had 41% and Microsoft had 26% flaky tests respectively. The survey also exhibits how these tests decreased the efficiency of CI with approximately 50% of failed jobs that were manually restarted, succeeded on the second run.

In my experience, the most common reasons for test flakiness are :

  1. Test design
  2. Environment issues

SOLUTION

TestNG already provides us the failed tests after the execution in a file named testng-failed.xml. Using this file and Harness’s Retry failure strategy with Mark As Failure post retry action we will develop a pipeline to handle flaky tests.

PRE-CONDITIONS

  • TestNG — unit testing framework. Let’s assume that the testng.xml is in src/test/resources
  • Maven — build management. In our pom.xml
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<version>3.0.0-M5</version>
<configuration>
<suiteXmlFiles>
<suiteXmlFile>${executionSuiteFiles}</suiteXmlFile>
</suiteXmlFiles>
</configuration>
</plugin>

THE PIPELINE

  • Stage Name : Sanity
    Under Sanity stage, we have two Steps
    - step 1 : Run Sanity — This will run the first round of tests
    - step 2 : Run Failed Automation Sanity — this will run the flaky tests

STEP 1 : RUN SANITY

Script : Run mvn clean test command against our testng.xml

mvn -ntp clean test -DexecutionSuiteFiles=src/test/resources/testng.xml

Add Reports Path: target/surefire-reports/junitreports/*.xml

Failure strategy : Ignore Failure

STEP 2 : RUN FAILED AUTOMATION SANITY

Script : Check if testng-failed.xml exists in our target folder. If yes, we run this mvn test against this xml. Otherwise, we exit success indicating that there are no failed tests.

if [ -e target/surefire-reports/testng-failed.xml ]; then
touch src/test/resources/failedTests_rerun.xml
cp target/surefire-reports/testng-failed.xml src/test/resources/failedTests_rerun.xml
mvn -ntp test -DexecutionSuiteFiles=src/test/resources/failedTests_rerun.xml -DskipCDNGTest=false
else
echo "No failed tests to run"
exit 0
fi

Conditional Execution : Always execute this step

Failure strategy : Retry with post retry action as Mark As Failure

Retry failure Strategy in general provides us 3 parameters :

  • retry count — To mention the number of retries we can perform against the failed subset suite
  • retry intervals — To provide the cooldown period between each retry
  • post retry action — To mention what happens after executing all the tests in each retry count

EXAMPLES

Example 1:

Retry failure strategy retry count is 2, with retry intervals of 30s

Let’s assume there are 100 tests.

  • 100 tests will run
    - Out of these 100 tests, 50 fail

30s pause

  • In the first retry attempt 50 tests are executed.
    - Out of these 50 tests, 25 fail

30s pause

  • In the second retry attempt 25 tests will be executed.
    - Out of these 25 tests, 10 fail

30s pause

  • In the second retry attempt 10tests will be executed.
    - Out of these 10 tests, 2 fail

The overall pipeline status will be marked as ‘Fail

Example 2:

Retry failure strategy retry count is 2, with retry intervals of 30s

Let’s assume there are 100 tests.

  • 100 tests will run
    - Out of these 100 tests, 50 fail

30s pause

  • In the first retry attempt 50 tests are executed.
    - Out of these 50 tests, 25 fail

30s pause

  • In the second retry attempt 25 tests will be executed.
    - Out of these 25 tests all of them pass

The overall pipeline status will be marked as ‘Success

IMPORTANT NOTE

  • Every retry count will run only the failed tests captured in the previous retried run.
  • The maximum amount of retries = retry count + 1

--

--