Hourglass into Pyramid: how you can improve the structure of your tests
My name is Uladzislau Ramanenka and I’m a Senior iOS QA Engineer at Bumble Inc, the parent company that operates Bumble and Badoo, two of the world’s highest-grossing dating apps with millions of users worldwide.
We strive to deliver the best possible customer experience. To this end, we are constantly improving our apps and regularly deliver many new qualitative features. For us, one of the ways we prevent bugs from escaping into the wild and affecting customer experience is by using automated testing. The earlier the bug is caught, the less expensive it is to resolve. The other thing test automation allows us to do — is support the ability to change. Whether adding new features or doing a large redesign — automated tests can quickly pinpoint errors. They enable us to change the software with confidence.
It comes as no surprise that we have a long history of test automation at Bumble. In our blog, there are a few stories on this topic. In fact, nowadays test automation is integral to all parts of our applications: backend, services, frontend and mobile clients.
In this article, I’d like to talk about iOS test automation. This is because throughout my career at Bumble, I have been closely involved in the testing of our iOS applications (they are native and written in Objective-C & Swift). Even though some of the paragraphs might feature iOS-specific tools and terms (e.g. XCTest), general principles and approaches can be applied ubiquitously. So even if your project uses a completely different technological stack, I’m sure you’ll find this article useful.
What is a Test Pyramid?
The term “Test Pyramid” is a metaphor that reflects the grouping of software tests into buckets of different granularity. It also gives an idea of the number of tests we should have in each of these groups.
This is the original Test Pyramid by Mike Cohn. it consists of 3 levels: Unit Tests, Service Tests and User Interface (UI) Tests. Regardless of the specific names or granularity you choose for each level, it illustrates 2 main points:
- Tests are required for each of its different levels;
- The higher up the pyramid you go, the fewer tests you should have.
In other words:
- Write lots of small and fast unit tests
- Write some middle-level tests
- Write very few high-level end-to-end tests
The number of levels and their names may differ. The most common model uses the following four categories of tests (from top to bottom): Manual tests and End-to-end tests, Integration tests, Unit tests. Consequently, the Test Pyramid should look like this:
We all strive for perfection and would want to have an ideal Test Pyramid. However, as software projects grow, often the shape of the Test Pyramid can become quite the reverse to that intended and, in fact, a poorly maintainable Ice-Cream Cone.
This happens when there is not enough low-level testing (unit, integration and component), too many tests that run through the UI and an even larger number of manual tests. A slightly better variation sees manual tests transformed into the end-to-end suite.
How do you end up with this ‘ice-cream cone’? Oftentimes, it’s when you have different teams, perhaps working in silos, who add tests at different levels. For instance, developers write unit and integration tests independently from QA engineers who write end-to-end tests. Not only does this lead to the wrong distribution of tests by levels but also, because some of the test scenarios are automated at several different levels, the effort is duplicated.
iOS Test Levels at Bumble explained
Here are the levels that we currently have for iOS applications, from top to bottom:
- Manual Tests — checking the functionality without the help of autotests. Currently, these activities are limited to a handful of cases and we try to automate the most repetitive and tedious parts of manual testing as much as we can. However, we still have them. For instance, when a new feature is introduced (covered by autotests, of course): Developers and QAs have a so-called “1-hour QA Session”. The goal here is to ensure the feature is made according to the agreed-upon standards and ready to be shipped. Moreover, since our apps are released every week, prior to shipping them we perform manual release testing. We check the most important functionality that can’t be automated or that requires additional attention. Finally, we have various activities, like Testing Dojos that involve collective testing of the apps by numerous people from various teams.
- End-To-End Tests — these are black-box tests written by our QA engineers that focus on the client’s integration with other Bumble services and infrastructure. We use Calabash, a cross-platform automation framework. Tests are written on Ruby and use Cucumber and Gherkin syntax. In my previous article, I lifted the curtain on some of the process patterns we follow during the development of end-to-end tests. What’s more, we have published a sample project to showcase our framework, which you can find here. We also have a subset of tests that go through the app and take screenshots of every screen, across various device configurations and languages. We call these tests — Liveshots, and you can learn more about them here.
- Component tests and Component Integration tests — these are particular types of black-box testing with test cases based on the specifications of the feature components, and their integration both between each other and with iOS services. They are meant to be isolated from the rest of Bumble systems. They are written in Swift (like the other tests in the lower levels of the pyramid) and rely on an in-built XCTest framework. We will return to them later. For this article, I’ll combine Component tests and Component Integration tests into one category named Component tests. This is partially for simplicity because currently, we have them both defined on the application level. In future, we intend Component tests to be moved to specific feature modules.
- Visual regression Tests (VRT) — these are tests that check that UI design matches expectations and integrates UI components correctly with the OS UI subsystem. Often these types of tests are called “Snapshot tests”.
- Unit Tests — these tests ensure that a unit meets its design and behaves as intended. A unit is often an entire interface, such as a class, but it could also be an individual method. They should ensure that all non-trivial code paths are tested (including happy path and edge cases).
We’ve already mentioned a ‘top heavy’ variant — that resembles an ice cream cone. This type is characterised by a lack of tests at the lowest levels (unit tests). If medium (integration) tests are missing, the pyramid will resemble an hourglass. This is the current structure of our tests.
The hourglass comprises many end-to-end tests and many unit tests, but few integration tests. It isn’t quite as bad as the ice-cream cone, but it still results in too many end-to-end test failures, that could have been caught more quickly and more easily using a suite of medium-scope tests. Moreover, end-to-end suites tend to be slow, unreliable, and difficult to work with. Unfortunately, the complexity and non-determinism (flakiness) of end-to-end tests is something one can’t escape from. Martin Fowler said the main reasons for non-determinism are:
- Lack of isolation
- Asynchronous behaviour
- Remote services
- Resource leaks
Building the Test Pyramid
Now we need to turn our Testing Hourglass into the Test Pyramid. We acknowledge that tests are required on all levels yet encourage tests of smaller size and narrower scope, as they make failure diagnosis quick and painless.
Our desired mix of tests is determined by our two primary goals: engineering productivity and product confidence. Favouring lower-level tests will give us increased confidence quickly, and early in the development process. Larger tests, on the other hand, will act as sanity checks as the product develops. Eventually, they should not be considered as a primary method for catching bugs.
When considering your mix, you might want a different balance. If you focus mainly on integration testing, you might discover that your test suites take longer to run but catch more issues between components. Focus more on unit tests and your test suites will likely complete very quickly, and you will catch many common logic bugs. Having said that, unit tests are unable to verify the interactions between components, such as a contract between two systems developed by different teams. A good test suite, therefore, contains a blend of different test sizes and scopes that are appropriate to the local architectural and organisational realities.
Obtaining the desired Test Pyramid structure is not something that can be achieved overnight, it’s a lengthy process. Here are the principles that we follow to change the pyramid:
- Test coverage and levels should be discussed at the planning stage. As soon as the new feature is about to be started, it gets an assigned QA. They actively participate in the feature’s kick-off together with developers and all other parties involved. Post kick-off and having reviewed all the documentation in detail, the QA and developer start working on the various test scenarios. This is the time when they discuss and agree on the proper test coverage for the feature. Only when this has happened coding for both features and tests can start.
- Tests may be of different granularities, but they should always add value. Remember: the higher the level of the tests, the fewer tests there should be. Therefore, we are pushing the tests as far down the test pyramid as we can.
Note: when a higher-level test gives more confidence in the application working correctly, it should be added. Otherwise, it is better to stick to the lower levels.
- Treat code of the autotests similarly to the application under test code. Code of the autotests should get the same level of attention. Otherwise, we risk ending up in a situation where we have an unmaintainable tests codebase. That won’t be much help to us and will require a great deal of effort to work with.
Writing tests on a lower level
Being a QA Engineer, I more often work with the tests on higher levels: manual and end-to-end. They occupy a major part of my working day: whether it’s supporting a new feature with tests or doing a manual re-check of a tricky bug report.
At the same time, during the feature development, we discuss and review the tests on all levels. A compulsory step for any feature is to create a Test Plan. It precedes the active development phase and involves sketching the prospective tests. This is also the time when QAs can prompt developers to ensure all newly added elements have identifiers to be used in end-to-end tests or to request necessary helper tools (we call them QAAPI methods, read about them here).
Even though I’ve taken part in such discussions, I have still never felt fully involved in lower-level tests ideation or development, nor more importantly, neither in advocating for moving tests from top levels down (to keep the pyramid healthy). For me, seeking more involvement in the development of the tests on the lower levels was not only due to natural curiosity. Like many of us, I knew that sticking to an extensive end-to-end suite we could potentially be missing out on shorter test time, faster feedback and more efficient results, i.e. all benefits that we get from tests on the lower levels.
To put this into perspective, currently on the Bumble iOS application, we have around 900 end-to-end scenarios in different test suites. The vast majority are run in parallel on simulators (as much as possible) and are relatively fast. The latest measurements show that the average end-to-end test on the iOS simulator takes from 30 to 90 seconds to run (including setup and tear down). Therefore, the full test suite run would take about 20–30 minutes to finish. To get the whole picture, a test run on the iOS real devices (a different subset of tests that can’t be migrated on the simulator, e.g. they need a real Camera or Permissions) takes ~12 mins with an average test duration of 2 min. Note that, luckily, we don’t run all these tests on every branch. We’ve introduced a smart logic that carefully picks only relevant tests for the changed modules and features. However, if we go down the pyramid, we get not only faster runs but also more frequent runs and less manual supervision. Speaking of execution time, jumping ahead I can reveal that across different configurations, 12–15s would be an average time for the full Component test scenario run.
Earlier this year we introduced an amazing initiative, our so-called “Focus Fridays”. This programme is intended to provide relief from the seemingly endless cycle of calls, emails, messages. It allows everyone to take two Fridays per month to have more time to think, reflect, relax, and work without interruption. I decided to dedicate this time to dig deeper into Component tests.
Moving end-to-end tests to Component level
Component tests are acceptance tests that test user experience through UI interactions. They are based on the Apple XCTest framework’s UI tests functionality. The application is tested as a black box and all external interactions (i.e. network access, push notifications) are mocked or simulated.
Our current solution works as follows:
- During development, the tests are run against the real server and interactions with the server are recorded and stored.
- After the feature is merged, all the tests are run in playback mode only.
- Later, if the test needs amending — it needs to be re-recorded. Before selecting this approach we carefully considered all its pros and cons.
Our current challenges in Component tests, our approaches to resolving the possible non-determinism and tackling the flakiness, are all topics for a separate article. All I can tell you so far is that we have a zero-flakiness policy for Component tests and to date, we have around 300 iOS Component tests.
It’s worth noting that our Component tests do not cover the following:
- Components external to the iOS system, such as network, server and Apple Push Notification service (APNs). They are checked in end-to-end tests.
- User interface design, layout and appearance. These are checked in Visual regression Tests.
- Logic is defined in a set of classes without a user interface or a single class. (Unit Tests)
For new features, we add Component tests straight away. However, we still have quite a few existing end-to-end tests that can be moved down the pyramid. To that end, we undertook an investigation and discovered various candidates. I wanted to try it myself to see first-hand the difference/benefits between end-to-end and Component tests.
Firstly, here are some of the examples of end-to-end scenarios that can be migrated to a lower level and don’t require any external dependencies or server interaction.
They will mostly retain the same idea, structure and steps. Only implementation specifics, language and framework will vary.
Now, let’s take the last example and see how such transitions impact the test running time. For the initial end-to-end test on our continuous integration build agents, the average time was around 1 minute:
For the migrated scenario, the average time is less than 15s:
Apart from a noticeable increase in the speed of the tests, we should also point out the improvements in the stability that were achieved as well. This was thanks to the improved isolation, due to mocking of the server and network interactions that I described earlier.
Our teams write different types of tests to ensure the application works as expected. No matter what application you are working on, it’s always a good idea to keep the Test Pyramid in mind when you’re aiming to optimise the test coverage. Its basic principles are always valid and can be applied ubiquitously. To achieve that, everyone needs to understand them and share them. Taking the Test Pyramid into account, you should consider where you are focusing your testing efforts: even though UI end-to-end tests are important, they are also the most expensive ones to write and maintain and the slowest to run. It’s vital that the whole team discusses the Test Plan prior to feature development and decides together on the test coverage.
We are still actively working on improving the test coverage and adding tests on all levels of the pyramid. Paramount to us is to have the right tests on the right levels. For the existing scenarios, we consider migration two or even three levels down. It may well happen that once end-to-end after re-work scenario will end up in the Visual Regression Tests suite. Or a Component test can in fact be substituted with a Unit one.
Having tried Component tests myself, I now understand their specifics better. Additionally, I have learned much more about the XCTest framework features, internals and limitations. Now, not only do I have a better understanding of our Component tests ecosystem, but I’m eager to spread this knowledge across the QA team.
My hope is that this will help us be more effective when planning new tests’ coverage, as well as in our work related to test maintenance, support and migration.