Covering A/B tests with UI tests:

How to avoid getting tangled up in your own code

Vitaliy Kotov
Bumble Tech
Published in
12 min readFeb 22, 2019

--

Hi there!

My name is Vitaliy Kotov and I spend most of my time working on automated testing issues at Badoo. In this article, I tell you all about how we have organised the UI testing work with our numerous A/B tests, the problems we have had to deal with and the workflow that we eventually achieved. Welcome to the jump break!

Before we begin…

You’re going to see the word “test” a lot in this article. That’s because we will be talking both about UI tests and A/B tests, at the same time. I try wherever possible to distinguish between these two concepts and to structure my ideas so that the text reads easily. However, if at any point I leave out the first part of a term and simply use “test”, then you can assume I am referring to UI testing.

Happy reading!

What exactly are A/B tests?

First of all, let’s define A/B testing. According to Wikipedia:

A/B testing (bucket tests or split-run testing) is a randomized experiment with two variants, A and B. It includes application of statistical hypothesis testing or “two-sample hypothesis testing” as used in the field of statistics. A/B testing is a way to compare two versions of a single variable, typically by testing a subject’s response to variant A against variant B, and determining which of the two variants is more effective.

For our project, A/B test is when a feature works a bit differently for different users. Here are some examples:

  • a feature is available to one group of users and not another;
  • a feature is available to all users, but works in different ways;
  • a feature is available to all users, works in the same way but looks different;
  • a combination of all three options.

In our company, to make all this logic work we use a tool called UserSplit. Now let’s discuss what using A/B tests means for a testing department and for automators in particular.

UI test coverage

When we speak of UI test coverage, we are not referring to the number of lines of code that we have tested. This should be fairly obvious since just opening a page can involve many components, even before testing begins.

Over my years in testing automation, I have come across multiple ways of measuring UI test coverage. I won’t list them all here but let me simply say that we prefer to base this figure on the number of features that the UI tests cover. It isn’t a perfect method (and personally I have never come across a perfect one), but it works for us.

So now let’s get back to the main topic of this article. How can you measure and maintain a good level of UI test coverage, when each feature can behave differently depending on the user?

How features were initially covered by UI tests

Before the UserSplit Tool became available to the company and before we had a really significant number of A/B tests, our strategy for covering features with UI tests was to only cover features that were already in production and that were well established.

That’s because in earlier stages, when a feature had just gone into production, there would still be a “fine-tuning” period, and its behaviour and appearance could still be modified. However, it could also fail to become an established feature and might never get to be seen by users at all. Writing UI tests for unstable features is costly and was not something we went in for.

When we introduced the development of A/B tests into the process, at first nothing much changed. Every A/B test had what is called a “control group”: a group that only looked at a kind of default behaviour of a feature. UI tests were written specifically for this. All that we had to do when writing UI tests for such a feature was to remember to include the default behaviour for the user. We call this process a “force” of an A/B group.

Firstly, allow me to explain in more detail what a “force” is, as it still has a role to play in this story.

Force for A/B tests and QaAPI

We always talk about QaAPI in our articles and conference talks. However, as yet we haven’t dedicated a full article to this in-house tool, but I believe that day is coming soon.

To cut a long story short, QaAPI allows you to make requests from a test to an application server using a special backdoor, in order to manipulate data. Using this tool, we are able to prepare users for concrete test cases, send them messages, upload photos and so on.

By using this very QaAPI we can force an A/B test group; all you need to do is name the test and the desired group. The call in the test might look something like this:

QaApi::forceSpliTest(“Test name”, “Test group name”, {USER_ID or DEVICE_ID});

We designate the last parameter as the user_id or device_id which this force should begin working for. We specify a device_id wherever there’s an unauthorised user, because in such cases there won’t be a user_id parameter. That’s right — we also have A/B tests for unauthorised pages.

After the call of this QaAPI method, an authorised user or device owner is guaranteed to be able to view the version of the feature that we have forced. We wrote precisely this kind of call into our UI tests, which covered the features that fall under A/B testing.

We carried on like this for quite a long time. Our UI tests only covered the A/B test control groups. At the time, there weren’t very many of them, and this worked well. But over time, the number of A/B tests began to increase and almost all new features started to be launched under A/B tests. Our approach of covering them just with the control versions of features was no longer feasible. Here’s why…

Why cover A/B tests?

The first problem is coverage.

As I mentioned above, over time, almost all new features have started to be released under A/B tests. Apart from the control, each feature has one, two or three additional options. It seems that for features like this, coverage never exceeds 50% even in the best-case scenario, and at worst, it will be around 25%. Previously, when there were only a few of these features, it didn’t have much impact on the overall coverage figure, but now it has begun to do so.

The second issue is long A/B tests.

Currently, some A/B tests take quite a long time. However, we are continuing to put out releases twice a day.

So, you can see that the probability of breaking one of the versions of the A/B test at this time is extremely high. This would, of course, affect user experience and defeat the whole point of A/B feature testing. A feature can show poor results in one of the versions, not necessarily because a user doesn’t like it, but because it doesn’t work as intended.

If we want to be confident in A/B testing results, we must ensure that all versions of the feature function as expected.

The third problem is the validity of UI tests.

The A/B test release is where the A/B test has gathered enough statistical data, and the product manager is prepared to open up the successful variant to all users. The release of an A/B test occurs out of sync with the release of the code, since it depends on configuration settings and not on the code itself.

Suppose for a moment that the control variant was unsuccessful and so wasn’t released. What happens to the UI tests that covered it? You’ve got it: they break. And what happens if they break an hour before the release of a build? Can we conduct regression testing on this same build? No, we can’t. As we all know, you won’t get very far with broken tests.

This means that we have to be prepared to conclude any A/B test early, so as not to hinder the efficiency of UI tests, and consequently the next release of a build.

Conclusion

The obvious conclusion we can draw here is that we must cover A/B tests, in all their variants, entirely with UI tests. Does that make sense? Great! That’s all, folks! Thanks for reading!

… just kidding! It’s not quite that simple…

An interface for A/B tests

The first thing that struck us as awkward was the way of checking which A/B tests and feature versions had already been covered and which had not. In the past, naming UI tests according to the following principle had worked for us:

  • name of the feature or page;
  • case description;
  • test.

For example, ChatBlockedUserTest, RegistrationViaFacebookTest and so on. At this point, it was also becoming clear that the names of the split tests were no longer fit for purpose. Firstly, the names were becoming extremely long. Secondly, when the A/B test was completed we would have to rename the tests, and this would negatively affect the accumulated statistics that the UI test’s name takes into account.

Constantly grepping for code for QaAPI calls is another joy.

So we decided to remove all forceSplitTest() QaAPI codes from the UI test code and transfer the data on where and what kind of forces were needed, to a MySQL table. For this, we made a UI presentation in Selenium Manager.

It looks something like this:

From the table, we can identify which UI test we want to apply the force of a particular A/B test to, and in which group. We can specify the name of the UI test itself, the test class or All.

We can also identify and distribute this force to authorised or unauthorised users.

Moreover, we have developed our UI tests such that they receive data from this table when they are launched, and force those that directly relate to either the test that has been launched, or to all tests.

Using this method, we have now managed to gather all the A/B test manipulations into one place, and made the list of A/B tests that have been covered is easy to view.

Similarly, we have created a form for adding new A/B tests:

All this allows us to add and remove, easily and quickly, redundant forces without creating a commit or expectation, for when it is dispersed across all of the clouds where the UI tests are being released, and so on.

UI test architecture

The second thing that we decided to concentrate on was a review of our approach to writing UI tests for A/B tests.

Allow me to briefly explain how we usually write UI tests. The architecture is fairly simple and familiar:

  • Class tests − this is where the business logic of the feature being covered is described (basically, these are our test scenarios: I did something, and I observed something);
  • PageObject classes − this is where all the interactions between the UI and the locators are described;
  • TestCase classes − this is where you’ll find the general methods that do not directly relate to the UI, but that can be useful for a number of classes (for example, interactions with the QaAPI);
  • Core classes − this is where the logic of elevated sessions can be found, as well as logging and other items that don’t need to be touched when writing a regular test.

Overall, we are very happy with how this architecture works for us. We know that, if the UI has changed, we only need to alter the PageObject classes (in this case the tests themselves should not be touched). If the business logic of a feature has changed, we can alter the scenarios.

As I described in one of my previous articles, everyone at our company works with UI tests: both the guys in the manual testing department and the developers. The simpler and more easily understandable this process is, the more frequently people are going to launch tests, without having any direct connection with them.

As I explained above, however, in contrast to established features, A/B tests come and go. If we write them in the same format as regular UI tests, we will end up having to constantly remove code from a lot of different places after completing the A/B tests. You already know that we don’t always manage to find time for refactoring, especially when everything is working well without it.

Nonetheless, we don’t want to allow our classes to become burdened with unused methods and locators, as this will make the PageObjects complex to use. So, how can we make life easier for ourselves?

At this point, PhpStorm comes to our rescue (thanks to everyone from JetBrains for the convenient IDE), and specifically this feature.

In short, it allows us to divide up code into “regions” using special tags. We tried it − and we liked it! We began to write temporary UI tests for active A/B tests in one file, dividing up the zones of code into regions that indicated the class in which the relevant code should be placed in the future.

In the end the test code looked something like this:

In each region there is code that relates to one class or another. No doubt there is something similar in other IDEs.

By doing this, we covered all the versions of A/B tests with one class test by placing both PageObject methods and locators there. Once completed, we firstly removed the unsuccessful versions from the class, and then it was quite easy to spread the remaining code across the required classes according to what was indicated in the region.

How we cover A/B tests now

You can’t simply take all A/B tests and cover them in one go with UI tests. Besides, that isn’t really the task at hand. From an automation point of view, the task is to quickly cover only the important and long-running tests.

Nonetheless, before the release of any A/B test, however small it is, we want to make sure we can launch all UI tests in their successful version and be sure that everything is working as required. We also want to make sure that we can replicate high-quality, working functionality for all users.

The solution I mentioned using a MySQL table is not suitable for this purpose. The problem is that if we add a force, it will immediately be included for all UI tests. This affects not only staging (our pre-production environment, where we launch our full range of tests), but also UI tests that were launched against the branches of separate tasks. My colleagues from the manual testing department have to work with the results of these launches, and if the forced A/B test has a bug in it, the tests for their tasks are also going to fail. They are only be able to identify the problem with their task, not in the A/B test. As a result, a lot of time ends up being spent on testing and investigation − not something that many people will be happy about.

So far, we have made do with minimal changes, by adding the possibility of identifying the target environment on the table:

This environment can be changed instantly in an already existing entry. This means we can add a force for staging only, without affecting the test performance results for separate tasks.

Conclusion

At the start of this story, our UI tests covered only the basic (control) groups of A/B tests. We realised, however, that we wanted more, and came to the conclusion that covering other versions of A/B tests was also going to be necessary.

In summary:

  • we created an interface for convenient checking of A/B test coverage, which means we now have all the information we need regarding how UI tests work with A/B tests;
  • we produced a means of writing temporary UI tests with a simple and effective flow for their future removal or transferral to a range of permanent ones;
  • we learned how to easily and painlessly test A/B test releases, without interfering with other UI tests that had been launched and avoiding unnecessary commits in Git.

All of this has allowed us to adapt the automation of our testing of continually changing features, to check and increase the level of coverage with ease and to avoid accumulating legacy code.

Do you have experience of bringing order to what seemed a chaotic situation, and have you managed to simplify life for yourself and your colleagues? Tell us about it in the comments! :)

Thanks for reading!

--

--