Mobile App Testing Strategy at Very Large Online Platforms (VLOP)

12 min readMay 22, 2024

Hakkim Alavudeen
Head of App Engineering @ Zalando

Table of Contents

Defining App Testing Strategy
The Elements of our Guiding Policies
Coherent Actions
Success Measures
Results

The Elements of the Good Strategy Kernel.

Defining App Testing Strategy — Context

No strategy exists in a vacuum. The context (the problem space and its forces) is going to deeply shape your strategy. Here are some challenge parameters to guide our discussion:

Capacity constraints
~100 native iOS and Android engineers
Mounting compliance regulations
Increasing pressure to deliver more customers journeys, faster
Large testing related tech debt across all test types

What does delivery entail in the mobile apps world?

Submit a single binary (representing the native app in code) to a third-party app store.
Once approved by the app store, the release will need to be downloaded and installed on customers’ phone.
Ideally this shouldn’t happen too frequently (to minimise mobile network data usage). 4 or even 3 week releases are considered quite good in the industry.

Shorter Release Frequency and DORA

Interestingly, it turns out that reducing the deployment frequency also directly correlates with gaining competitive business advantage. So where does that come from? All of you are probably familiar with the Dora report — but if you’re not:

As early as 2013, The DevOps Research team at Google (aka the DORA team) conducted large scale surveys to try and understand statistical links between performance factors. In 2014, they published the first State of Devops report. And since then they continue to publish these reports annually.

“DORA’s research has consistently found that a team’s software delivery capability reliably predicts the value that the team provides to their organization. Survey respondents who achieve high levels of software delivery performance report that their organizations perform better on business objectives. Performance can be assessed according to four software delivery metrics” - DORA Website

The Challenge Summarised: In order to gain competitive business advantage, we want to deliver delightful customer experiences faster but with less engineering capacity. More precisely: We want to deliver reduce our Deployment Frequency from 3 weeks (status quo back then) to 2 weeks.

Now, the full Scope of this involves improvements in the maturity of:

App Architecture, esp. towards App Modularisation
Sprint and Release Planning
Operational Excellence e.g. Telemetry, Production incidents response capabilities
Quality Assurance (Testing, esp Automation)

I’ll focus on the Improving Testing Maturity section for this article. (And follow up with the others in later articles).

Defining Strategy at Apps — Approach

Book: Good Strategy Bad Strategy, by Richard Rumelt.

We wanted a well established approach to defining strategy that is based on sound principles and data for evidence, and we really like Richard Rumfelt’s concept of structuring strategy around the Strategy Kernel, from his book, Good strategy, Bad strategy.

The premise of this book is that you’ve got to be able to recognize the elements of bad strategy in order to understand what good strategy looks like. Some examples he gives in his book of bad strategy are things like aspirational goals or objectives, inability to make choice, and Fluff i.e. lofty language that tries to masquerade the lack of meaningful substance underneath.

He postulates that sound strategy contains three elements at its Kernel or core:

Diagnosis correlates to Challenges.
Guiding Policy correlates to Choices made.
Coherent Actions correlate to a concrete roadmap, where activities incrementally build on top of each other

The Strategy Kernel Applied

Diagnosis:
→ How do we improve Testing Outcomes with minimal capacity?
A Guiding Policies:
→ Focus efforts on Improving Customer Facing UI tests first
→ What further policies/decisions does this imply?
A Set of Coherent Actions:
→ Create and Execute Roadmap based on decisions made.

The Elements of our Guiding Policies — Challenges

Let’s dive deeper into decisions and implications thereof. The following questions immediately come to mind:

… Why Higher and not Lower in the Test Pyramid (Unit tests)?

… Does that mean unit tests are less important?

… Wait what does “Customer Facing UI Tests” even mean?

So let’s start off by clarifying these. The Testing pyramid concept and taxonomy suffers from some fundamental concerns:

1. The Test Pyramid — Updated

“Don’t become too attached to the names of the individual layers in Cohn’s test pyramid. In fact they can be quite misleading: service test is a term that is hard to grasp (Cohn himself talks about the observation that a lot of developers completely ignore this layer). In the days of single page application frameworks like react, angular, ember.js and others it becomes apparent that UI tests don’t have to be on the highest level of your pyramid — you’re perfectly able to unit test your UI in all of these frameworks.” — “The Practical Test Pyramid”, Martin Fowler, his official blog.

2. Fuzzy Definitions

“What do you call a test that tests your application through its UI?

An end-to-end test?
A functional test?
A system test?
A selenium test?
Tests running against less of the stack?

I’ve heard all them, and more. I reckon you have too. The same equally frustrating inconsistency. Just what, exactly, is

an integration test?
A unit test?

How do we name these things? Gah!

The problem with naming test types is that the names tend to rely on a shared understanding of what a particular phrase means. That leaves plenty of room for fuzzy definitions and confusion.” — from ‘Test Size’, Google Testing Blog.

3. Atomic Design & Test Types

The Atomic Design categories illustrated with examples.

So which is the correct “Unit”? …That probably depends on who you ask!! There is no one universally correct answer (atleast in our opinion anyway!)

Developing our Ubiquitous Language

Customer Facing UI Tests for us mean tests that are intrinsically:

Hermetic (Isolated, deterministic, repeatable tests)
Fidelitous (High fidelity i.e. as real and close to the actual customer experience)
Agile & Maintainable to best effort
Could be Big, Medium or Small tests

“The more your tests resemble the way your software is used, the more confidence they can give you” — React Testing Library Principle.

NB: This does NOT mean unit tests are less important!

Elements of our Guiding Policies — The Key Choices

There is a plethora of mobile apps in all kinds of form factors, often with a customized operating systems of the phone device manufacturer, like Samsung’s ONE UI that’s a layer above the Android OS.

Constraint: It is impossible to automate every customer facing feature or customer feature of a mobile app!

Solution: Narrow down scope through segmentation/bucketing

The MECE Principle

Developed by Barbaro Pinto at McKinsey (also author of the Pyramid Principle: Logic in Writing and Thinking), this is a categorisation framework that makes it easier to group items and data in Complex Systems, in order to better analyze and derive useful conclusions. The two categorisation rules of MECE are:

Mutually Exclusive — An item or data record can only be in one category at a time
Collectively Exhaustive — All items or data records must be included in one category (e.g., your categorisation structure can not exclude an item or data record)

The Key Choices

→ Focus efforts on Improving Customer Facing UI tests first

Q1: What tiers of Customer facing UI tests should we write?

Q2: Which type of testing tool(s) should we employ to write them?

Q3: Do we run our tests on real or simulated/emulated devices?

Q4: Should we employ local or cloud device farms to run UI tests?

Q5: Which device farm provider should we adopt to run UI tests?

Q6: Which UI test framework should we employ?

Next, I’ll deep dive into our rationale in answering these questions towards creating the guiding policy.

Q1: What tiers of Customer facing UI tests should we write?

We proposed the following MECE Categories:

Tiers of automated UI Tests
Types of test scenarios
Scope of screen interaction

Let’s decompose each of these types down further.

Category 1: Tiers of automated UI Tests — Our Taxonomy / Ubiquitous Language

To illustrate using the example of an Order History feature:

Functional tests: Validate that a given user interaction correctly works as set out in the original specification of the feature, typically produced by the Product team.
Example: automated UI test to ensure that all past orders are shown in the Order History view when the user opens this view.
Non-functional tests: Validate that a correct user interaction doesn’t degrade under certain conditions.
Examples:
i) Automated UI test to ensure that the Order History view loads fully within a maximum latency threshold.
ii) Automated UI test to ensure that the Order History view is responsive across devices with a minimal CPU/Memory spec threshold.

Further, Non-functional tests can be classified as:

Security and Authorization tests e.g. ensure that a malicious user cannot access the Order History view of another user.
Performance tests (customer-specific parameters) e.g. verify that Order History view loads within 1s even if the client has thousands of past orders.
Load Tests (positive) e.g. verify that Order History view loads within 5s under server-side load (when order history service is under significant load) or under client-side load (simulated high memory or background resources consumption).
Stress Tests (negative) i.e. Increase server-side or client-side load parameters systematically in order to find capacity limit points of the Order History view and ensure that the system performs as expected under load, and fails gracefully under stress.
Usability Tests i.e. these test accessibility, intuitiveness and appeal of user interactions in the Order History view.

Category 2: Types of test scenarios

To ground theoretical concepts in practice, let’s again illustrate the test scenario types, each with an example, this time from the “Shopping Cart Checkout — Edit Delivery Address” use case:

Positive tests or Happy Paths: Verify correctness of a user interaction with valid, required parameters.
e.g. ensure that valid delivery addresses can be submitted by the user.
Positive tests + optional parameters: Verify correctness of a user interaction with valid, required parameters AND valid optional parameters.
e.g. case 1, plus a valid input for the optional “Additional delivery info” input.
Negative testing — Invalid input: Ensures that your application can gracefully handle unexpected user behavior and invalid input.
Examples:
i) ensure that user cannot update the address with a non-German city when Germany is set as the delivery country, or
ii) ensure that user cannot input a numerical value for the “Town / City” text field.
Destructive Testing — Edge cases: Intentionally attempting to crash the view.
Examples:
i) by aborting the submit operation while it is in progress by clicking the back button, or
ii) by writing small concurrency calls so as to simulate updating the delivery address by the same user from multiple clients.

Category 3: Scope of Customer Interaction

Screen UI tests: Checks critical user interactions within a single screen, by performing actions such as Clicking on buttons, Typing in forms, and Checking visible states, etc.
One test class per screen is a good starting point.
Models consistent state constraints within a single screen.
User flow tests/Navigation tests: Checks customer journey across screens.
Corresponds to a logical business use case e.g. the ‘Shopping Cart Checkout’ customer journey.
Ensures consistent state transfers across many screens.

Q 2: Which type of testing tool(s) should we employ?

Comparison of testing tool types against evaluation criteria.

As discussed earlier, some of the factors specific to mobile apps are:
Multiple screen resolutions; multiple languages i.e. localization; multiple form factors, orientations and dynamic screen content.

1. Capture and Replay
Able to record actions such as clicking, scrolling, swiping, or typing into a script. Then through the replay function, execute the exact same actions over and over again.

2. Coordinate-Based Recognition
Rely on predefined x and y axis coordinates to access and interact with the UI elements of the app.

3. OCR/Text Recognition
These tools obtain the text of the control elements that are visible on the screen of the mobile device. To determine if the text is visible on the screen, OCR technology is used.
Example: Eggplant

4. Image Recognition aka Snapshot Testing
Compare images in order to drive the user interface of an app. They take screenshots, for example, of buttons or labels or even an entire view of a given screen, that are embedded into your script. When the script is executed, the image recognition tool compares the current screen with the stored baseline image.
Example: Shot, Testify, Dropshots, Paparazzi

5. Native Object Recognition
Detect the UI objects with a UI element tree, accessed via XPath (XML Path Language), CSS (Cascading Style Sheet) locators, or the native object ID of the element. Enables native elements access, such as buttons, labels, views, lists, and other kinds of UI elements. Examples: Espresso, UI Automator, XCUITest.

Q3: Run tests on real or simulated/emulated devices?

Q4: Should we employ Local or Cloud Device Farms to run UI tests?

Local Device farm

Pros

Full control over devices.
Cheaper when testing on a small number of devices.

Cons

New devices released each year, representing continuous investment into devices.
Costly, in terms of number of devices needed to reach decent coverage.
Costly, in terms of maintenance efforts.

Cloud Device Farm

Pros

Wide array of devices available. Cumulative coverage across a wide range of devices possible when swapping out non-core devices with each new run.
Cost effective “Pay as you go” costing model.
No device maintenance required.

Cons

More expensive when testing on a small number of devices.
Test failures on remote devices are sometimes not reproducible locally.

Q5: Which device farm provider should we adopt to run UI tests?

Candidates

BrowserStack.
Firebase Test Lab.
AWS Device Farm.

Parameters

Vendor lock-in.
Cost.
Available phone & tablet devices.
Support for testing frameworks.
Test reporting functionality incl. Test flakiness, performance profiling.
Parallelisation, incl Sharding configurability.

Q6: Which Test Framework should we employ?

Today’s automated testing tooling ecosystem is huge, catering to a wide range of testing types and scenarios.In order to narrow down our choices, we put forward the following selection criteria:

Must facilitate writing hermetic UI tests i.e. the test is fully isolated from non-deterministic state and interactions e.g. through mocking of dependencies that introduce variability
Must enable engineers to minimize test flakiness.
Must be compatible with programming languages already familiar to engineers.
Must be performant and cost-effective.
Must be compatible with internal CI/CD infra, and AppCenter(for creating seamless internal builds).
Must be compatible with all of our tooling and testing recommendations.

Compatible UI Test Automation Ecosystem for Android:

Appium
Calabash for Android
Espresso
UI Automator
Robolectric
*Spoon

Compatible UI Test Automation Ecosystem for iOS:

XCTest / XCUITest
Appium

Coherent Action At Enterprise-Scale

As part of a pre-kick-off phase, i.e. Milestone 0 (M0), we could run a period of enablement in the form of :

In-person trainings on how to write tests using best practices with the selected tooling, and
Integration of tooling into our CI/CD infra and development workflows, so the adoption is seamless, and supporting onboarding of client teams
Setting up code coverage dashboards, particularly for coverage from UI tests, and sending out weekly trends on coverage data.
Implement missing capabilities in the platform/foundation for the testing tools.

Measuring Success — KPIs

Deployment Frequency (Northstar)
Code Coverage (per Module)
Change Failures
Probabilistic flakiness score (PFS, for Test Flakiness)
Per test, fixture, test set, module etc.