Pinterest Engineering Blog

Inventive engineers building the first visual discovery engine, 300 billion ideas and counting.

Pre-Submit UI Tests at Pinterest

--

Mansfield Mark | Mobile Test Tools Lead, Metrics Quality and Test Tools

Green brick wall with a large white arrow painted on it. The arrow points left.

Summary

In our efforts to shift left (in which testing is performed earlier, or moved left on the project timeline), this blog covers how we began running a large end-to-end UI test suite before every commit to our Android and iOS repositories. This project involved careful coordination of UI testing, test infrastructure, and developer productivity.

After shipping, we were able to

  • Decrease investigation and failure resolution time
  • Raise our test suite’s pass rate from <50% to >90%
  • Enable our UI test platform to support more tests from more teams

Code Change Lifecycle

Let’s break up the change lifecycle into 4 steps:

  1. Code Review: Share a pull request, address comments
  2. Pre-Submit Checks: Examples: Unit tests, linters, run on your pull request applied to some base commit
  3. Submit*: The diff is committed onto the main branch
  4. Post-Submit Checks: Examples: Integration tests, release builds

(*) Some organizations have checks within the “Submit” step beyond simple merge conflict detection, such as Uber’s SubmitQueue. While this is out of scope for this post, these checks are worth considering as a place to run tests, albeit with their own complexities and tradeoffs.

Motivation

As our UI test suite grew, we found it prohibitively difficult to maintain the tests. Failures on a non-critical path were often not addressed in a timely manner, with unsustainable consequences. For example:

  • The person fixing the tests was often not the one who had worked on the failing code
  • The person fixing the tests was often coming to that work several days later and after many context switches
  • Failures could stack onto each other, complicating fixes and obfuscating issues

A test suite with many failing tests quickly loses its value. Test cases would be either manually verified by a release team, or simply ignored. As the test suite deteriorated, fewer regressions were caught, and any future investment in UI tests appeared less worthwhile.

UI tests are slow, expensive, and unstable. As a result, they are typically run in the Post-Submit step. This takes them off of the developer’s critical path, and avoids delays due to speed or instability.

This post documents how we moved our UI test suite to run in Pre-Submit instead of Post-Submit. This change dramatically reduced our test maintenance costs and better protected our releases.

Challenge 1 — Ownership

We require tests to have exactly one team that is responsible for maintaining the test.

Owners will:

  • Monitor an alert channel to respond to issues with their tests
  • Decide how to fix a test, when to update it, when it can be deleted, etc.

To ensure accountability, we leverage the ability to silence tests, meaning we ignore their failures. Every silenced test must be resolved within two weeks by the owning team.

Silencing a test means someone decided these failures can be ignored. This decision is often unambiguous, such as an obviously outdated test. Otherwise, the test owners should have final say on whether to silence.

Test silencing creates a lifecycle that enables:

  • Rapid mitigation of failures on the main branch without fix-forward or rollback
  • A deterministic SLA for tests to be restored in the test suite
When a healthy test is failing or flaky, it is silenced after a light investigation on an urgent SLA. The test is then either fixed, stabilized, or deleted, within a strict 2 week SLA, which returns it to a healthy status.
Lifecycle of a problematic test

Challenge 2 — Balancing Speed And Cost

We run the tests on every published diff. For us, this means roughly 2.5 runs per diff. We also run the tests hourly from the main branch to act as a health check. At Pinterest, this translates to roughly 700 builds per week and 300 tests per build.

Tests on the critical path of code submission mean speed directly impacts developer velocity. We want to keep the test suite running under a reasonable time (such as 30 minutes) without incurring excessive costs.

Low Hanging Fruit

We found our speed could be improved with a number of simple test-level fixes.

  • Low timeout limits on test (5 minutes)
  • Fast failures (e.g. no indefinite waiting or infinite scrolling if something goes wrong)
  • Simplify tests (e.g. deep link directly to the page being tested instead of navigating to it)
  • Skip retries on silenced tests (it’s being ignored, we don’t need to run it again)

Parallelization

Running tests in parallel is critical to improving speed. However, running 300 tests in 300 shards is often prohibitively expensive. Within parallelization, there are many tools available to balance test speed and costs.

Android

We run our Android tests on Firebase Test Lab through a sharding tool called Flank. Flank supports a feature called “Smart Flank” that will intelligently shard tests based on historical run data to ensure all shards are expected to finish at the same time.

iOS

We run our iOS tests on macOS instances in AWS EC2 through a tool called bluepill. High parallelization decreases stability as the simulators compete for system resources, causing unpredictable network instability and frame drops. To compensate for our capped parallelization, we built a custom scheduler, pinpill, on top of bluepill to more efficiently utilize available simulators as tests completed.

Challenge 3 — Developer Experience

Running UI tests in post-submit means many developers will have little experience interacting with UI tests and their results. The flakiness of tests also means you may run into tests completely unrelated to your domain. We invested in the developer’s experience working with UI tests to minimize churn and frustration.

Debuggability

Every test result should have logs and video recordings available. Video recordings are critical to making UI tests more accessible.

Finding the test failure should be only a few clicks away. At Pinterest, we have a tool that collects a build’s test results and displays them all in one place. Test results are displayed along with:

  • Links to test artifacts
  • Test history
  • Test ownership information (Slack channel, team)
  • Button to silence the test
It shows test details, outcome, device, duration, video and log files, and overall stats for that test such as flakiness, failure rate, and test history.
Screenshot of a test result summary page from Pinterest’s internal tool

Support and Mitigation

Knowledge

Our team staffs an on-call rotation to help engineers understand UI test failures that are blocking their submissions. We host office hours and budget time for adhoc questions and issues.

Document all critical scenarios engineers will encounter, such as:

  • When to silence a test
  • How to investigate a failure, and common failures
  • How to handle false negatives, determine next steps, and recycle your previous run.

Result Caching

Inevitably an engineer will be wrongfully blocked at some point. The aforementioned “recycle” mechanism lets them skip unnecessary reruns. If a test fails on a build, and the test is determined to be a false negative, that test can and should be silenced. Upon re-running, all previous failures are now silenced, so we can skip the entire test run.

This serves a second purpose of keeping teams accountable for their tests. If their tests suddenly spike in flakiness and block engineers, their tests will be silenced, forcing them to investigate the cause.

Challenge 4 — Insulating main branch from failures

Experiment Ramps

Our client-side A/B test experiments are snapshotted every 30 minutes and merged in the application code. This creates a hermetic experiment state in source control.

By running the tests before these updates, we run our tests “pre-submit” as the experiments ramp up and down. We recommend blocking experiment ramps if the tests fail.

This pattern should be applied to any automated commit that risks breakages.

Stability Enforcer

Stability Enforcer is a tool that automatically silences tests when they show signs of instability. For example, if a test becomes >20% flaky over the last 20 runs, it will be silenced. If it consistently fails, it will not be silenced.

This helps to lighten the load of monitoring issues, and prevent excessive flakiness from reaching the developers.

Monitoring

Our on-call rotation monitors all failures on the main branch during business hours. These failures aren’t always tied to specific commits, so pre-submit tests can’t always prevent them. Common culprits are API outages, server response changes, and tool failures. Our team helps to triage or resolve these issues to minimize the risk of blocking developers.

Metrics

To make sure the test suite stays healthy as tests change and the app evolves, a few key metrics must be tracked and maintained.

Main Branch Pass Rate %

This is the core metric that indicates how often you will wrongfully block a developer’s change.

We recommend only counting test runs during business hours to avoid over-indexing on overnight or over-weekend failures, and distorting your success rate.

In addition to test stability, this measures how effectively your process responds to failures.

Test Speed

We recommend using a P90 time to ensure developers have a consistent experience that doesn’t leave them waiting around for a build to finish. Our initial goal was 30 minutes, end-to-end. We think a lower time is possible with more investment.

Silenced Tests

With an unstable test suite, many tests can become silenced. You should never have a large percentage of your tests be silenced. A high volume of silenced tests can imply:

  • Your engineering org is not equipped with the right tools or guidelines to write stable tests.
  • Teams are not meeting their SLA to resolve failures
  • Too few people own too many tests and they can’t keep up with them all

Number of Tests

The test suite will grow as the complexity of your application grows. Tracking the test suite size will help you get ahead of issues with scaling and maintainability.

Too many UI tests is generally not a good sign. UI tests are effective as a last line of defense to catch obvious failures. However, they are too slow and expensive to build for every permutation of user actions, unlike other forms of testing.

Ramping Up

Flipping the switch from post-submit to pre-submit is a big risk to productivity and churn and should be approached carefully. We recommend the following phases:

Polish your tests and your process

Spend a few weeks being extra diligent about test quality and reliability. Find and resolve every issue you see with your test runs.

Dry-run your process for catching and resolving failures ASAP. After the switch to pre-submit, the load will be noticeably less and distributed among other teams, but initially, you should do the legwork to make sure the process is going smoothly.

Communicate to ensure teams that own tests are aware of the stricter SLA after the switch to pre-submit. This gives them a chance to raise concerns, delete tests, etc.

Opt-In

Work with a small number of teams and run the pre-submit test suite on diffs submitted by those team members.

Write notifications or documentation to inform engineers their diffs are running the UI test suite, and link them to relevant resources.

We ramped up the opt-in phase to 10–15% of all diffs for 2 weeks before moving on.

Opt-Out

Similar to force submitting, build a mechanism to bypass the tests if something goes wrong. Make sure engineers are aware of this mechanism and when it’s appropriate to use it. Now, you can flip the switch: Run pre-submit tests on all changes, and turn off your post-submit runs.

Now that failures are on staged diffs, you lose visibility into whether the main branch is stable. Failures on the main branch are addressed urgently as they indicate flakiness or breakages.

Conclusion

Before this change, we struggled to keep a pass rate above 50% for our test suite. It was a constant game of catch-up, finding issues, identifying owners, and debugging. Now, we’re around a 90% pass rate, with a significantly lighter on-call burden.

Moving forward, there’s still a lot of work to be done. False positives and slow tests are a big risk to developer velocity. Some things we’re thinking for next steps:

  • Run UI tests selectively based on changed code
  • Mock or control API responses to limit test variance
  • Process test failure logs to detect infrastructure-level trends and issues

Acknowledgements

  • Alice Yang, Doruk Korkmaz, Freddy Montano, Jennifer Uvina, Joseph Smalls-Mantey, Matt Mo, and Ryan Cooke: For help building Pinterest testing infrastructure to support this project, and for the countless discussions on how to design our process.
  • Sha Sha Chu and Garrett Moon: For their support introducing this change to our Android and iOS teams, and for keeping us accountable to a good developer experience.
  • Maintainers of bluepill and flank for building the tools that made this possible, and support with feature requests and bugs.
  • Firebase community slack for help investigating and designing around infrastructure issues

To learn more about engineering at Pinterest, check out the rest of our Engineering Blog, and visit our Pinterest Labs site. To view and apply to open opportunities, visit our Careers page.

--

--

Pinterest Engineering Blog
Pinterest Engineering Blog

Published in Pinterest Engineering Blog

Inventive engineers building the first visual discovery engine, 300 billion ideas and counting.

No responses yet