Root cause analysis of a flaky test and its resolution

In the past few weeks I have been on a mission to get 💚 master build. We run the whole suite of ~270 espresso tests and they have been flaky for a looong time. This is the dirty secret of many a test automation engineers including me. The flakiness of these tests is elusive and no test fails the second time at the exact same spot, especially when you debug them locally.

So I started pruning away tests that I am know to be flaky but there is one test which was more flaky than others. While I could have the easy route of removing the test from the suite, this time I searched deep for the resolution. And I am so glad I did.

Principles

My hunch that there is a problem _outside_ scope of the test is: this started happening only in the last 2, 3 weeks.

There is an adage to treat test automation code as production code. Then how come we are so accustomed to removing a flaky test over figuring out the root cause?

I am guilty of this many times. If 1/3 of our users are facing a crash sometimes, won’t we get to the bottom of it?

Overview of the problem

This particular test creates a contact with image and then verifies that our app renders this image correctly. So test code goes something like this

public class ContactsWithPhotoThreadViewTest {

@BeforeClass
public static void setUpPhoneContacts() {
ContactsTestHelpers.deleteTestContacts();
Drawable testContactPhoto1 = getInstrumentation().getTargetContext().getDrawable(R.drawable.gdv_pic);
ContactsTestHelpers.createTestContactWithDrawable(testMockName1, new String[]{testMockNumber1}, testContactPhoto1);
Drawable testContactPhoto2 = getInstrumentation().getTargetContext().getDrawable(R.drawable.gdv_contact_pic);
ContactsTestHelpers.createTestContactWithDrawable(testMockName2, new String[] {testMockNumber21, testMockNumber22}, testContactPhoto2);
}

@AfterClass
public static void clearPhoneContacts() {
ContactsTestHelpers.deleteTestContacts();
}

/*
Picture appears if the thread is a contact and the contact has a picture
*/
@Test
public void testContactPictureAppears() {
TimelineSummaryScreenItem summaryItem = new TimelineSummaryScreenItem(threadsArea.getSummaryList(), 0);
onView(summaryItem.contactLetter).check(matches(not(isDisplayed())));
onView(summaryItem.contactCircleNoImage).check(matches(not(isDisplayed())));
}
}

Setup -> Create contacts

Act -> open timeline

Assert -> Check contacts has image

Data points collected in loose sequential order

  1. This test never failed locally but only on google test lab(FTL)
  2. Also on the test lab it fails occasionally about 1 of every 3 times.
  3. There are no infrastructure failures and video and logcat seems alright, contacts are being inserted and I can see the logcat output
  4. Found an IOexception in logcat during loading of the contact image, this never showed up locally and shows consistenly everytime this test fails in FTL
  5. Found that -1 is being used for contactID instead of the real one. This is the initialization value.
  6. There are no code paths which would let contactID have -1 value.
  7. Found that there is another huge stacktrace in all these failing tests logcat

Workarounds pursued

  1. Change the app code logic to show contact image using a library method, that still gave the same flakiness
  2. Added a pre-step to assert that no exception has been thrown when creating a contact but strangely that didn’t work too.
  3. Tried running on API level 25 and 27 instead of 26 with no avail.

Final solution

The yahoo moment is when I discovered there is another error in logcat (7) and found that this particular error is showing up for others: https://stackoverflow.com/questions/51782548/androidxappcompat-iart-error-android-view-viewonunhandledkeyeventlistener

This is not a crash but a huge long stacktrace with origination point on a base view in our app.

From that stackoverflow this error sometimes happens on devices lower than API version 28 because of androidx. And we just migrated to androidx at about the same time this test started to flake out.

Also my local emulators are all version 28 so they don’t have a chance to reproduce this error at all.

So I changed our automation to run on only devices 28 and voila this flaky test isn’t flaky anymore.

Conclusion

While I do realize that I was lucky in this instance to have somebody else hit the same problem and finding that post on stackoverflow and a answer which hints at androix and API 28. That’s how many these problems are solved. You need to know what to look for.

While this particular test is Android, Java, Espresso stack the same principles can be applied to any stack. This post is my way of forcing myself to look more deeper into the root cause rather take the easy way out and remove the test causing a drop in test coverage and loosing the benefits of test automation.