E2E iOS UI Tests: Lots, Green, on PR

Published in

AvitoTech

11 min readJun 5, 2019

It’s been a year since we set out to automate our regression test suite. And we have achieved a lot. We’ve managed to reduce the testing time of our app from days to hours.

Proof of profit

I’ll start by showing some graphs based on actual data.

The number of automated tests:

Here’s how we reduced the testing time of our app:

There is one blue line per one release indicating how much time people spent testing manually. The scale of each graph is different, pay attention to correlation only.

Success rate statistic over the past year:

The success rate statistic shown here follows the development of test automation at Avito. You can see some events that were ruining our test suite, events where we were fixing something. And to the right you can see that the metric has stabilized.

I hope these graphs and numbers have convinced you that this post is worth reading. The second part of this post describes methods and algorithms.

How did we achieve this?

In late 2018, we at Avito were facing a problem — we were using too much time on manual regression testing. And we were planning to cut the release cycle from 1 month to 1 week in the future.

Avito is a large app — we apply thousands of manual test cases and manual regression testing could take us a week for a single platform (iOS or Android).

We thought that we could automate our test suite, so we researched our testing needs, what kind of tools we might want to use, etc.

And in December 2018, we invested a few weeks in a proof of concept of automated regression testing. We shortlisted roughly 30 tests and integrated them into CI and our business processes. We still were not sure whether it was going to work, but at least we had hope. When our MVP (minimal viable product) was ready, other teams started experimenting with writing tests.

All subsequent development was a struggle. UI testing for iOS is still at its infancy. We encountered multiple problems when simply writing tests for an app. And running them in CI was another big challenge.

I think that everyone trying to automate a large test suite for iOS notices how hard it is. There are lots of limitations and testing frameworks never work as expected (https://github.com/artyom-razinov/XCUILimitations); there are no officially supported methods of handling certain testing tasks (like configuring camera/photos/etc. permissions); some functions are not supported at all (simulating push notification when the app is not running). Xcodebuild crashes and hangs when running tests.

In this post, I want to briefly describe what we did, what should be done, what testing is, how tests are coded and run, how testing can be made profitable. And once you have broken even, how to maximize profit.

More technical posts will follow for those interested.

The tools

We use Apple’s XCUI tools to run and test UI. In fact, we have upgraded XCUI’s functionality to make it more usable. XCUI is just not good enough for us off the shelf.

We’ve written a kit supporting various features, including hacks to set up the state before the testcase. Check it out at https://github.com/avito-tech/Mixbox. A readme file with a list of supported features is included.

There are test runners available that can run iOS UI tests. Bluepill is really a good one, however, it was not powerful enough for our needs. It lacks support for parallelized tests across multiple machines. So we’ve made a test runner based on fbxctest. It doesn’t require Mixbox and can run any test based on XCTest. Currently, our Mac Mini cluster is capable of running 50 hours’ worth of tests within just 1.5 hours. Check it out here https://github.com/avito-tech/Emcee.

We have also built some in-house tools for storing test reports and manual test cases and multiple test services. We are currently using Teamcity as a CI.

The tests

We write E2E tests. This means that our app acts as if it was downloaded from the App Store. It runs on the actual backend (a clone of what we use in production). Almost nothing inside the app is modified for testing purposes (and if it is, we take care to not change the app’s behavior).

We are also developing the way to write low-level tests, Gray Box tests, tests supporting mocking of classes, etc., but I am not going into details here.

Black box tests run in a separate process. Tests launch the app, send touches to the simulator, “see” pixels to make sure that the respective elements are displayed. There is also a trick for getting the text: we just get texts as strings, as we don’t use computer vision. But it is okay. Almost a blackbox is also okay.

How to start

First: simply start. Review all your test cases, or what you want to test. Collect the requirements of your tools. For example, you may want to:

Tap buttons
Check text
Switch the simulator language
Take photos

After that, find your tools. Try them. Tools may not behave as expected. For example, XCUI has limitations when it comes to testing whether the element is visible, it has bugs when getting the UIButton text, etc.

If something appears to be impracticable, it is not necessarily a problem. If you can automate only half of your test cases, it still can be fine. And it actually may be doable. For example, you can always try mocking something inside the app.

If you want to take photos in your tests, don’t assume that you need a device farm. We did not try running 30–40 PRs daily on devices, but we know some who did. I suggest that you search the web for the drawbacks of this approach, like how much effort does it take to implement, to support, performance, cost, battery swelling, etc.

Do a lot of research. Remember, it may take years (if you add up all the time your team spends on writing tests and testing tools). Make the right choices. And if no perfect option is available, choose the better one.

Tips on how to make and keep UI tests green

I think the only way to maintain a 100% test success rate is to run the tests as a blocking build on Pull Request.

This is how you don’t merge changes that break everything

However, it is very hard to make them stable enough to run on PR. Performance is another story. It shouldn’t take forever to run tests on Pull Request. I think, 30–60 minutes is okay.

You cannot expect to have many 100% stable E2E tests right after you have started writing tests. You need a stable test suite, stable CI, a way to run tests fast. You should be prepared that it can take time.

We started running a good amount (100..300) of tests as a blocking build to PR after nearly a year of development of our testing tools and infrastructure. Before that, we ran a fixed number of tests on a pull request, then we started also running modified tests. You should try to run tests on pull request as early as possible, one test will suffice. Then focus on increasing the number of tests and maintaining the stability and duration of PR checks and time.

There are some techniques that we use to improve the performance and success rate of our tests, and thus increase the number of tests we can run on PR and thus the stability of our full regress suite.

Stable tools

We use tools that are fault-tolerant. For example, if the simulator stops responding, it is created again and test restarts. If some of the machines in our farm become unresponsive, they are blocked by the test runner and tests restart.

We use backend-driven UI a lot, a lot of A/B-tests, so UI can be different at different times. We don’t want to fix the tests every time someone changes something in the backend. We don’t want to add “if”s to the code. We find UI elements by their IDs (and use same IDs for same elements, even if a feature has multiple implementations), always scroll to them automatically, make tests independent of each other (clean every possible state), almost every action or check has a lot of retries and fallbacks, etc.

Stable test backend

We don’t have one. It is very expensive to deploy everything at every pull request, even at every run of the full regression test suite. We have a few instances of the test backends that are shared by the product teams for their tests (web, mobile web, mobile apps, etc).

Test backend is shared, it is not as stable as production. One could say that we should make it more stable, but there are some workarounds that allow us to ignore the problems, and the problems aren’t that critical. We are not proud of this situation, but we don’t care much about it.

Retries

We retry every test a lot of times if it is running on a pull request. We don’t want to bother developers by failing their builds unless they break everything down.

We don’t want to have flaky tests either. Retries allow tests to be flaky. But our colleagues at the Android team implemented an improvement for the Pull Request check: every new or modified test is required to pass 5 times out of 5.

Profit: +10% success rate in our case. Flakiness is eliminated. Useful on PR.

Example: 10% here means that every 1 out of 10 successful tests was retried at least once. We have some problems with stability of tests, we’ll be solving them.

Restarts

Sometimes some services are unavailable. Sometimes they remain so for a long period of time. So we want to avoid this kind of problems. The night before the regression testing, we start several builds with a 3 hours’ interval. Every subsequent build reuses previous results and restarts failed tests only. It turned out to be useful. Once, it saved 12% of the tests.

Profit: +2% to +12% to the success rate at release (up to 95% in our case, where 5% are long-forgotten tests that don’t work).

Trusted tests

A VERY, VERY important thing for us. It allows us to enter the world of “running tests on PR.” It makes PR checks stable and fast.

In short: we analyze the run history, we select tests that were not failing or even flaky in the past N full runs. We call them trusted and run them on pull request. These tests are extremely stable, so developer PRs is rarely blocked for no reason.

It is rather straightforward. You only need to store a history somewhere and then query it and pass to the test runner. That’s all. We were lucky as we had developed the reporting tool and test runner with something like that in mind, and it was very easy to make everything work.

This mechanism also allows you to automatically ignore outdated and not working tests and unignore them when they are fixed, also automatically. So you can do just fine with a partially green suite.

One of the possible variants of this algorithm

Comparing to the target branch

The aforementioned “restarts” technique does not work with Pull Requests, because it requires a long waiting period. We don’t want to and we can’t add another hour to our 15 mins UI test check on Pull Request. But we can eliminate infrastructure problems.

The algorithm we currently use is as follows (assume we create a pull request from the “source” to the “target” branch):

If the test passes on the source branch, it passes
If the test fails on the source branch, restart it on the target branch
If the test passes on the target branch, it was broken by the current PR changes
If the test fails on the target branch, the code (probably) didn’t affect it, do not block PR
If the test is changed in PR we can not compare it to the target branch.

Here is what we got after implementing this algorithm. The chart shows how many tests were skipped after they were failed on the source branch. We were able to fix about 40% of pull requests that didn’t really break tests.

How much tests were ignored, among only builds with tests that failed on source branch

We have not integrated this feature into Emcee yet, it is purely external so far. And we were experiencing some problems with it. The problem was that we started all tests and only then restarted failed. So there is a long time interval between running tests on the source and target branches. The service can be down during tests on the source branch, but up and running when comparing to the target branch. This sometimes resulted in PR being blocked without a reason. Example: a service was down and 37 tests failed showing same error (e.g. 500 status code from integration API), then 36 tests also failed on the target branch, but one test passed. Ideally, tests should be run simultaneously on the source and target branch after a failure. And more than once, to eliminate any possibility of the situation described above.

We solved this issue simply by moving some runs on the source branch after the run on the target branch. This is what changed in the algorithm:

We assumed that most of the tests will pass at the first stage, the longest one. Then some tests will be compared to the target branch. Then very small amount of tests will be checked again on the source branch. The interval between running tests on the target branch and the source branch will be shorter. The probability of the issue described above will be lower.

It seems that our assumption was correct. Here is a chart after we updated the algorithm:

It is a very good practice to stop blocking someone’s PRs. Obviously, it can’t be used in regression testing, because there is no such thing as “target branch” when we run tests on some specific branch.

Impact analysis

This method allows you to not run every test at pull request, but still get a lot of things checked.

In short: impact analysis is a process of detecting tests that could be affected by changes in code.

Our Android team has an impressive system for impact analysis, they have great tools for that. They identify what tests should be run by changes in the app code. It is really a big deal because black box tests are separate from the app.

In iOS we use a very low-tech, very simple, but very helpful solution to detect changes: we just get “git diff” and run new or modified tests. We require that those tests pass, they can’t be ignored. So red tests cannot be added to a repo, tests cannot be broken if someone modifies them.

That’s all for now

We’ll keep you updated and we’ll go deeper into details in our future posts. CI, Reports, Processes, Release Chain, Mixbox, Emcee, etc. Stay tuned.