Design your automated tests to scale for keeps

Paving the way for a shared culture of technical excellence

--

Early on in a person’s QA career, they’ll quickly discover a truth: E2E tests are like a Swiss army knife. Easy to write, and mostly free of dependencies on developer time, they’re a very convenient way of ensuring release health.

They start developing a set of automated E2E tests. From there, test coverage grows and the team gets used to continuous feedback. Eventually they normalize the practice and convert manual test plans into automated E2E flows.

It’s tempting to cover features this way. This is especially true when a team decides to prioritize commitments to feature development over the continuous consideration of technical debt and testing overheads. It’s sometimes easiest for a team to independently increase E2E coverage as a way of creating missing safety measures.

The problems start with maintenance. Especially when failing tests block deployments. It’s tricky to do this on a budget, and when you have just a few people with the skills to keep things going.

A traditional workaround is to hire. I believe the solution we should seek is not that. To me, the key to scaling quality sustainably is to create a lived culture of technical excellence.

I’m writing this article as a sequel to Recoding Engineering Culture, which is all about the early stages of the Quality Engineering roll-out at AMBOSS.

In this second part, I’m detailing why we decided to change course, and how we’re moving E2E tests down the Test Pyramid.

After a little more than a year into scaling the QE roll out to AMBOSS teams, we were already in a better position. In Q2 of 2021, we had some interesting numbers to show.

We had rounded up about 280 Cypress E2E tests, covering mostly frontend and backend flows and services. E2E stage duration on CI was ~18 minutes. Since Cypress has built-in retries, when a test failed, run time would increase to 45 minutes or more. After integrating Cypress Dashboard for both frontend and backend pipelines to help with parallelization and load balancing, we eventually peaked at 140k monthly test runs.

Average Jenkins job duration of our frontend pipeline could take 1.5 hours. Backend pipelines could take up to 2 hours. Unit test coverage in these domains was at about 30%. We had just started figuring out integration tests.

In terms of support requests in Slack, the team often couldn’t catch a break, and would have to deal with build and test failures. Fixing flaky tests, helping developers figure out what went wrong, catching actual bugs. High levels of ad hoc maintenance work proved detrimental to team morale and mental health.

Failing tests were often just commented out to unblock people. Even creating a one-stop-shop quality knowledge base didn’t ease maintenance requests much. AMBOSS currently employs about 100 engineers, after all.

While this metaphor has been used before, I’ve come to understand just how E2E testing efforts are directly analogous to the stages of development of a Xenomorph — the main monster character in the 1979 film Alien. This creature was originally created by Swiss painter, sculptor, and designer H.R. Giger.

In the beginning they’re relatively easy to handle. Then they start to evolve, and with their increasing size and complexity your effort, frustration, and anxiety also grows.

It gets harder to keep on top of things, but you can mostly manage. Until you can’t and you lose control — culminating in you hanging onto the step of a metal ladder by the strength of your elbow joint, tested against the sheer force of the vacuum of space.

In the end you eject the Xenomorph Queen — aka you take care of monstrous E2E maintenance overhead. But victory comes at a cost. Burnt out developers, blocked teams, resentment, and a general feeling of dread permeates the atmosphere.

It was obvious this wasn’t sustainable, but what caused the situation hadn’t yet changed.

Addressing our inverted test pyramid was not a new idea for us, but we were lacking the disposition, resources, and skills to make it happen. Eventually the stars aligned. With a few new key hires on board, we kicked off an improvement project that would eventually turn out to be a game changer.

We started small with just two teams. The initial goal was to assess and improve our E2E repository to help out our frontend teams.

To do that, we needed information. These are the data points we looked at for assessing the end-to-end repository:

  • Reasons why the time to build a Docker image on CI is high
  • Reasons why the time to run linters on CI is high
  • Assess pipeline configuration and investigate build steps
  • Assess DataDog dashboards
  • Create build and test metrics for trunk branches

Some findings we could easily address, like upgrading our AWS machines and cutting down on automatic retries. More troublingly, we found high median flakiness rates and slow run times. We also found big chunks of untestable code, making E2E tests the only option short of refactoring.

Once we had our data, we set up a semi-regular check-in with stakeholders; Engineering Managers of frontend-facing product teams, two QE team members, and our CTO. We discussed the current situation, looked at our troubling finds, and spent some time generally investigating.

Finally we confidently committed to a new approach: Shift Left Testing.

Specifically, we started out with Incremental Shift Left Testing meant to correct our overemphasis on E2E tests.

We created a framework for how we addressed the issues, and rationale for why we decided to work on the different tasks involved:

  • Map existing tests against AMBOSS features
  • Identify feature ownership by team
  • Define a global critical path for AMBOSS
  • Define p0 and p1 issues by team
  • Assess tests (do they actually test the feature; do they adhere to our best practices; are they flaky or slow; what are their dependencies)
  • Decide what to do (keep as is; rewrite; deprecate; make part of a smoke suite)

Since we were missing some skills, we made sure to offer training. We got our Frontenders access to Testing JavaScript with Kent C. Dodds, which enabled them to write unit, component, and integration tests.

To increase Cypress knowledge, we hosted a workshop series that taught developers how to write, run, and debug tests. One of the major enablers of this project also hosted a much requested workshop teaching developers what unit, component, integration, and contract tests are, when they’re used, and how they’re written.

The guilds involved created and committed to testing best practices for all levels of tests in their domains. Suddenly, we had momentum to keep pushing quality and technical excellence topics on multiple levels within Engineering. Guided by those best practices, the team’s Definition of Ready (DoR) and Definition of Done (DoD) were adjusted accordingly, and habits slowly started to reform.

The QE team got busy addressing the flakiness and slowness of existing tests. We created stubs and mocks where feasible, and when they didn’t impact the plausibility of test results. Large test suites were split up and parallelized. With the exception of commits to our trunk branches, tests were made skippable to avoid unnecessary test runs.

As we went about untangling feature ownership, our mission began to grow. Wherever we looked we found dependencies to other people or teams. As a result, more stakeholders became involved in our check-ins and over the course of 2021, we scaled our new approach to almost all of the product teams at AMBOSS.

It was when we started assessing existing tests and looking into what could be moved down the pyramid that it got complicated.

Remember that I said we found big chunks of untestable code?

While some tests could be easily moved, others had potentially significant commitments, such as refactoring large parts of our codebases. This problem doesn’t come with an acceptable quick fix. When teams have been missing direction and the means to commit to writing testable code over large spans of time, things just accumulate.

While we are for sure working on it, we still maintain legacy code. Any fellow software developer might now start shrieking or ducking for cover. This is a natural reaction, but without choosing to refactor, we will eventually hit a roadblock.

This is also where the conversation shifts from a technical topic, to technical excellence.

It’s a vague term, so here’s my definition. Technical excellence is a lived expression of a culture based on what’s industry best practice, for example The Test Pyramid, Extreme Programming, Clean Code, Clean Architecture, Accelerate, or Team Topologies. It’s also about mindset, one where staying current and learning new skills is valued.

A big pillar of technical excellence is relational and social. After all, culture is the sum of the ideas, customs, and social behaviors of a particular people or society. The goal is thus always to establish discourse between actors within the society.

When code is symptomatic of culture, and Engineering is seen as a function of Product, realignment becomes necessary. Cross-functional processes might need to be redefined, roles and responsibilities cleared up, shared roadmaps and commitments created. These things take time.

They say to end things on a positive note, so I’m proudly concluding this post by showing off some of our achievements!

Overall, we’ve reduced our E2E test suite size by a good third, deprecating 66 tests while increasing coverage on other levels. We’ve sped up our deploy pipelines by an average of about 40%.

We’ve stabilized and deflaked our remaining E2E tests, reducing our median failure rate down to 2%. Tests now consistently run within 7 minutes.

One major change was to increase resources of a database, which turned out to be the bottleneck that was single-handedly constituting to the top reason for flakiness in E2E tests. Tests with a dependency to that database now run successfully nearly 100% of the time, unless they actually catch a bug somewhere.

Going forward we’re continuing to correct our overemphasis on E2E tests, improving pipelines and infrastructure, and investing in covering the critical parts of our service layer with contract tests using pact.io.

Last but not least, we’re looking to implement a better static analysis tool and other solutions centered around technical and operational excellence, as well as DORA metrics, and maybe even fitness functions. Stay tuned!

--

--