Handle flaky TestNG tests using Harness Failure Strategies
PROBLEM
In Software industry, flaky tests are one of the most common problem. We usually call a test flaky when it sometimes, not every time fails on the first run and passes in the subsequent 2nd or 3rd attempt. Not only do these tests slow down the progress but impede releases and costs a fortune on the longer run.
A survey published in 2022 shows that big companies like Google had 41% and Microsoft had 26% flaky tests respectively. The survey also exhibits how these tests decreased the efficiency of CI with approximately 50% of failed jobs that were manually restarted, succeeded on the second run.
In my experience, the most common reasons for test flakiness are :
- Test design
- Environment issues
SOLUTION
TestNG already provides us the failed tests after the execution in a file named testng-failed.xml. Using this file and Harness’s Retry failure strategy with Mark As Failure post retry action we will develop a pipeline to handle flaky tests.
PRE-CONDITIONS
- TestNG — unit testing framework. Let’s assume that the testng.xml is in src/test/resources
- Maven — build management. In our pom.xml
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<version>3.0.0-M5</version>
<configuration>
<suiteXmlFiles>
<suiteXmlFile>${executionSuiteFiles}</suiteXmlFile>
</suiteXmlFiles>
</configuration>
</plugin>
THE PIPELINE
- Stage Name : Sanity
Under Sanity stage, we have two Steps
- step 1 : Run Sanity — This will run the first round of tests
- step 2 : Run Failed Automation Sanity — this will run the flaky tests
STEP 1 : RUN SANITY
Script : Run mvn clean test command against our testng.xml
mvn -ntp clean test -DexecutionSuiteFiles=src/test/resources/testng.xml
Add Reports Path: target/surefire-reports/junitreports/*.xml
Failure strategy : Ignore Failure
STEP 2 : RUN FAILED AUTOMATION SANITY
Script : Check if testng-failed.xml exists in our target folder. If yes, we run this mvn test against this xml. Otherwise, we exit success indicating that there are no failed tests.
if [ -e target/surefire-reports/testng-failed.xml ]; then
touch src/test/resources/failedTests_rerun.xml
cp target/surefire-reports/testng-failed.xml src/test/resources/failedTests_rerun.xml
mvn -ntp test -DexecutionSuiteFiles=src/test/resources/failedTests_rerun.xml -DskipCDNGTest=false
else
echo "No failed tests to run"
exit 0
fi
Conditional Execution : Always execute this step
Failure strategy : Retry with post retry action as Mark As Failure
Retry failure Strategy in general provides us 3 parameters :
- retry count — To mention the number of retries we can perform against the failed subset suite
- retry intervals — To provide the cooldown period between each retry
- post retry action — To mention what happens after executing all the tests in each retry count
EXAMPLES
Example 1:
Retry failure strategy retry count is 2, with retry intervals of 30s
Let’s assume there are 100 tests.
- 100 tests will run
- Out of these 100 tests, 50 fail
30s pause
- In the first retry attempt 50 tests are executed.
- Out of these 50 tests, 25 fail
30s pause
- In the second retry attempt 25 tests will be executed.
- Out of these 25 tests, 10 fail
30s pause
- In the second retry attempt 10tests will be executed.
- Out of these 10 tests, 2 fail
The overall pipeline status will be marked as ‘Fail’
Example 2:
Retry failure strategy retry count is 2, with retry intervals of 30s
Let’s assume there are 100 tests.
- 100 tests will run
- Out of these 100 tests, 50 fail
30s pause
- In the first retry attempt 50 tests are executed.
- Out of these 50 tests, 25 fail
30s pause
- In the second retry attempt 25 tests will be executed.
- Out of these 25 tests all of them pass
The overall pipeline status will be marked as ‘Success’
IMPORTANT NOTE
- Every retry count will run only the failed tests captured in the previous retried run.
- The maximum amount of retries = retry count + 1