Building the right Quality Strategy at a startup with limited resources

Adi Ben Dayan

Published in

Remitly Israel (formerly Rewire)

9 min readMar 10, 2022

Rewire’s Quality Strategy

Written by: Adi Ben Dayan, VP R&D and Co-Founder @ Rewire (ex-Microsoft, ex-8200)

Photo by ThisisEngineering RAEng on Unsplash

**Having a clear Quality Strategy is key for achieving a high-quality product**

Your customers expect nothing less than a smooth, error free experience. To achieve that, it is critical for you to have a high-quality product. Resources are limited and you need to make sure you spend them (and your organization’s attention) on the right things. This is especially true in areas where there are so many ways to go and so many different approaches for achieving the same goal.

However, Quality Assurance, automation maintenance, and monitoring efforts can require lots of resources.

So, how do you invest the right amount of effort for your needs?
You build a Quality Strategy.

Take into account your product’s needs and the company’s capabilities to achieve an actionable solution

Like everything in life, you will need to prioritize your efforts, budget, and quality goals. In this post you’ll find guiding questions that can help you decide where to focus.

First of all, make sure to understand the unique needs of your product and the capabilities of your company.

Product’s unique needs:

Which areas will have the most impact (user frustration, lost revenue, etc) if they have bugs or fail?
Are there flows that endanger your company? (For example, relating to license, compliance, reputation, or fraud.)
In which flows will a failure result in completely losing the customer, and in which ones is there more grace?
Do you have flows which are Geographically oriented? Specific browsers or devices that are more critical?
Is it more important to cover the backend processes or the frontend flows? Why?

Company’s capabilities:

Can you afford a large-enough manual testing group? Will it provide quicker and better results than investing in automation?
Do you have enough engineering resources and capacity to build and maintain automation infrastructure (mocks, sandboxes, test environments, devices orchestration, CI, etc.)?
Is it currently more important to deliver fast and (possibly) lower quality results or do you have to improve your quality immediately / have super high quality from the start?
Do you have a strong BI (Business Intelligence) team that you can rely on for alerting and anomaly detection?
Is it possible and can you afford testing some or all of the flows in production? Will it be with manual testers or automation?

At Rewire, we call the most impactful flows critical flows — they get the most attention.

This list will help you prioritize your work and decide which flows should be tested on every commit, which will be on the nightly suite, and which ones will be tested (only) once every few weeks. This also helps decide which flows get ‘full test coverage’ and which flows can get less attention.

Also, on new (and major changes to) flows with higher impact you will want a gradual rollout, among other defensive measures. At Rewire, every critical flow we release is controlled by a feature flag, which allows us to gradually roll it out while closely monitoring it and checking the feature’s monitoring and alerting.

Detection and Prevention — two sides of the same coin

Ok, so you’ve identified and prioritized critical flows that you want to cover, now it is time to decide how you measure and cover them.

At this point you want to define clear procedures and testing requirements for every type of flow and at every stage of a bug’s life cycle.

At Rewire we split the bug’s life cycle into two categories:

Detection — Detecting an escaped bug (bug that was deployed to production and affects real users)
Prevention — Detecting a bug on the dev environment — feature branch (before reaching master branch), or on master (before reaching production). Also called “Shift left” testing movement.

In order to reduce the impact of a bug to zero — we clearly want to find it before it reaches production, but we know there is no way to truly achieve zero bugs in production or that it will take too much effort (especially if you release on daily basis). Instead we create safety nets (automation, alerts, and procedures) to detect bugs in production as soon as possible and to make sure these escaped bugs affect only a fraction of users. Of course, before we deploy a new flow — we make sure we have addressed both the detection and the prevention aspects of that flow (based on priority and impact).

Examples of procedures for the critical flows at Rewire:

Every critical flow will be covered by automation tests + End-to-End/E2E tests (including mocks / sandboxes) that run, at least, on a nightly basis
Every critical flow will have a clear monitoring plan + BI dashboards and alerts to cover anomalies
Every critical flow which is geolocation-specific will be tested by a local manual tester at X locations in the world

These are examples of the heaviest procedures that we have, and we will not be able to support them for every side flow of the product. This is a good example of why we assessed the company’s abilities. At Bigger companies you can define and maintain these processes for more scenarios.

We believe that breaking down prevention and detection separately for bugs and following deeper processes for critical flows is a very pragmatic approach that allows us to move fast, deploy frequently, and still enjoy low impact of bugs.

Implementation with two teams

We decided to create two different teams who collaborate very closely:

1. The QA team — responsible for prevention — Automations, Quality Plans, and Test Plans

2. The Tech Support team — responsible for detection — Monitoring Plans, Alerts, Triaging and Retrospects.

This allowed the leads of the teams to develop deep proficiencies in their specific scope and build teams that can effectively move the needle.

Define and measure your key quality metrics and bug priority definitions

So far you have identified the priority of the flows and decided how to detect and prevent issues in every flow (or at least the critical flows). How can you tell if you are on the right path?

Measure and tag, the sooner the better.

Every step you make takes resources, time, and organizational focus, so you want to be sure you are heading in the right direction. My recommendation is to begin by measuring three key metrics:

Number of (different) escaped bugs
Number of impacted customers
Bug’s priority (tagging)

These three metrics can create a good baseline for most of the actions you will take. If you can, try to track these before making any organizational or procedural changes, so you will have that clear point of comparison.

Prevention (QA) Key Metrics — escaped bugs and their impact

Now for the most important part — the ongoing tracking of our work. We need to remember that Quality work is something that you could spend infinite resources on, but does not directly generate revenue. Every time you add a procedure or requirement, you are slowing down your organization. Thus, the quality strategy should always be challenged and you need to continuously reaffirm that you are not creating ineffective procedures.

The most important key metrics are (1) measuring how many unique bugs escape to production and (2) how many customers were impacted by every bug. These two clear metrics can help you track whether your strategy is effective, and whether specific changes you applied move the needle or ought to be scrapped.

In case you do not have enough manpower to tag and measure every bug, focus on the bugs which are defined as ‘blockers’, or bugs that are on important flows only.

Detection (Tech Support) Key Metrics — time-to-detect, time-to-to-triage, time-to-resolve

Bugs escape — this is the harsh reality of software engineering and even more so for SaaS (Software as a Service) solutions that change every day.

You need your strategy to include handling escaped bugs. The idea is to put into place processes and tools to reduce the number of customers impacted by a bug in the wild.

In order to reduce the impacted users you want to measure and improve on these 3 key metrics:

Time to detect — the time between when a bug was released to production, and when it was first identified by your team.
Time to triage — the time between the first identification that there is a bug and when you know how to classify it and understand whether you need to take immediate action.
Time to resolve — the time that it took you to fix the bug and deploy the fix to production.

At Rewire, we use two specific tools to reduce the time to detect:

Massive monitoring and alerting, both using specific tools like grafana and heap analytics, and also with our main BI system (Looker). We use Looker is by defining (as part of the Quality plan) a set of axioms that should never happen in production. We set an alert for each of these (relatively trivial) axioms and this helps us capture a large variety of issues.
For every important flow we maintain a dashboard that helps us quickly identify trends (and irregularities) once we suspect something is wrong, or once we get a general alert.

Improving Quality: Automation — Key for scaling

Automate. The key to scale here is to define the exact amount of effort, and the timelines to deliver the automation. There are many aspects to it and many tools to consider. Defining your automation roadmap can help you understand when you will be able to invest less in manual testing.

The guidelines here for Rewire were:

a. Decide which flows we test in the web and which on the mobile app

b. Decide which flows will be covered in a development environment and which on production

c. Decide for which flows it will be cost-effective to cover the error cases with E2E tests and in which scenarios we should focus only on the happy paths

Eventually, we decided to invest a lot in automation. Once we had the basic product running, we understood we would not be able to sustain the costs of manual testing for all our flows and sub-flows. This is especially true when you have lots of external integrations and many different flows that require significant setup before you start the test. We were not able to do it manually in an effective way. Both the web and the mobile apps are covered by E2E flows, as well as unit tests and integration tests. On the other hand, we identified specific flows which are specific to certain geographic areas and are using local testers to help us test these flows.

Improving Quality: Region specific testing (Applause, Beta testers program)

Another important tool we decided to invest in — are real users that test our beta products. We have a closed circle of users and testers that help us detect issues both in pre-production and the production environments. As these are the most expensive tests to run, you need to decide in advance which flows should be tested in what way (and why). You can also utilize this approach as an interim solution that allows you to ship flows before investing the full effort into automation. This can help decrease time-to-deliver and can also be useful in testing experimental flows.

Drawing your organizational Quality Map:

Summary and key takeaways

Setting and maintaining a high Quality bar is difficult, and can sometimes unnecessarily slow you down.

So how do you invest the right amount of effort for your needs?
You build a Quality Strategy.

Make sure you are clear about your goals and intentions and that you have a clear strategy that will help you and your team navigate the sea of options that exist for testing. Once you do, make sure you track the results carefully.

At Rewire, as we grew we modified and changed our strategy along the way. What suited us at the beginning does not necessarily fit us now.

Oh, if you got this far, it means that you care about your organizational quality, or that you are intrigued by Rewire’s passion for Quality :).

If it is the latter — we are hiring a Quality Group Manager who will help us take Rewire’s Quality Strategy to the next level.