Measuring meaningful availability / uptime of Wise

Published in

Wise Engineering

11 min readAug 30, 2024

At Wise we’re building the best way to move and manage the world’s money — making it faster, cheaper, convenient and transparent. Product uptime and availability is contributing to three parts of our mission:

An honest and meaningful uptime report is key to drive transparency of our systems.
High availability is one of the greatest factors in making our product convenient. It is very hard to be convenient if the product is not working as designed.
High availability means less retries and less waiting, driving us closer to instant too.

In this post we will talk about the reporting aspect of uptime and availability, focusing on transparency. We would like to share our journey and various technical considerations towards measuring meaningful availability/uptime for Wise.

Wise has a wide range of product offerings, condensing the availability or uptime of our business flows is not trivial. These flows consist of multiple business operations with various branches depending on user intent or group.

Providing a single number as business flow availability or uptime is always going to be an approximation, but we can and should be transparent about how we aggregate and what that represents.

Our goal is to have a meaningful measurement we can get behind as representative of user-experienced uptime/availability.

Definitions

First things first, let’s define some terms we’ll use:

Business flow: A high level, complex use-case of the Wise product, e.g. “send money” (we aim to measure the availability of this, as this is the most meaningful for the end-users)
Business operation: an individual operation a user can perform as part of one use case or business flow. A high level piece of functionality e.g. get a quote, set up a transfer, pay in for the transfer
API endpoint: a technical term. Which HTTP(S) API to invoke in order to perform a Business operation. Not a service or a series of calls, a single HTTP(S) endpoint to perform a (part or whole) of Business operation.
Availability: A ratio of successful operations (i.e not server errors) to total occurrences for a given time-frame. E.g. 9943/10000 = 0.9943
Downtime: Number of seconds the application produces errors, behaves erroneously, or is not reachable.
Uptime: Ratio of number of seconds the application did not produce errors, or behaved erroneously to the total number of seconds of the observed period.

Uptime vs. Availability

People often use these two terms interchangeably. However, they mean different things. Let’s build from the definition:

Availability and uptime are both a ratio (%) of “good to total”.

However it is very important to recognize that their viewpoint is totally different:

Availability is representative of the share of successful operations (regardless of their distribution in time).

Uptime is representative of the share of time where we could successfully complete operations (regardless of how many operations were actually performed).

Naturally if we have a perfectly even traffic pattern (e.g. exactly one operation attempted every minute all the time forever) uptime and availability will be the same nominal number.

In real-life application traffic is almost never perfectly even. There are daily/weekly/monthly (and so on) seasonalities, and spikes and valleys in traffic. If we measure the real traffic of a real application, its uptime and availability will never be the same nominal number! They will represent the same reality, but from different perspectives.

In general, if outages happen during high-traffic times then uptime will be a higher number, whereas if they happen during low-traffic times availability will be higher (for the same outage).

For example given the following operations/errors distribution:

Downtime: 100 seconds
Total time: 420 seconds
Successful operations: 30
Total operations: 44

Gives availability of 30/44 = 68.2% and uptime of (420–100)/420 = 76.2%

But if the same downtime happened over a different traffic distribution:

Downtime: 100 seconds
Total time: 420 seconds
Successful operations: 33
Total operations: 40

Gives availability of 33/40 = 82.5% and uptime of (420–100)/420 = 76.2%

There is no right or wrong choice, but given that people often turn ratios into “downtime seconds” in their heads, we believe it is better to report uptime to support this subconscious translation.

Simplifications

Rigorously pairing all operations to users performing them, ordering them and painting the various user journeys (which are different for each user), and deciding success or failure of these journeys is a complex and resource intensive endeavour — especially for a large user base, and a large set of product offerings.

Instead, approximating this value based on general user behaviour is much more feasible and can still produce meaningful results. In general, this is what the industry does. Not everyone is going for this much detail, but we believe in transparency and honesty even if it means we report “less nines”. We measure key business operations performed by our full user-base and use that information as a proxy to our overall business flow availability.

Ways to measure business flow uptime

Synthetic tests

A common practice to measure a flow is to set up an automation that behaves as a real production user, and performs a series of predefined set of business operations (the user journey) and log the outcome of the journey as a whole.

This approach is very easy to set up, and the frequency of these test runs are completely in our control. We can set up a perfectly even test schedule (e.g. once a minute) so it can report both uptime and availability (as with the perfectly even traffic they are the same nominal number). All in all easy to set up and report. Wise has been using such black-box approach for years.

The black box test served us quite well when we had a simpler product offering (one basic journey of sending money).

However convenient it was, the black box testing approach has shortcomings:

Does not scale well for multiple business flows (or multiple branches of a flow), as if we ramp up variations (and leave frequency intact) we’d generate so much load on our system for just the “test” that our actual users become negligible load. This is one of the reasons why some other companies (e.g. Google) are also moving away from synthetic monitoring and gravitate towards real user monitoring.
When the product improves, APIs change. So the test needs immediate upgrade. The test can also break when APIs are retired or changed around. This can be manageable so long as there are only a few of these blackbox tests (ties back to scaling issues).
When the tester tool is down we cannot calculate or backfill data for the lost time period (that we need for accurate reporting).

This has happened a few times with Wise. Some internal rules of some internal API had changed, and suddenly we needed an extra step in the user journey to fulfil strong customer authentication. The test itself was down. Real users did not experience downtime, but our tooling suggested so. In these cases we had to choose from the following options:

Assume that a few minutes (or hours) never happened as we have no reliable data
Assume we were “up” based on user contacts and logs of real user interactions
Report outrageously low numbers even though the system was up, because the measurement tool was down.

Usually we opted for the first option and had a small asterisk in the report warning about resolution issues. Then we would run our incident process to make sure we catch these earlier next time without user impact. Then, next time, something completely different broke the test without user impact. As we got more and more processes around our tests, these got less and less frequent but still happened. New errors always find a way.

Test users are often handled differently than real users by backend applications.

In our example we had a test user (in the live production database) that was creating 1000+ transfers — every single day. Some of our KYC (Know Your Customer) and AML (anti-money laundering) systems regularly look for suspicious transactions and behavioural patterns to help prevent malicious actors misusing or taking advantage of our systems. The test user had to be exempted, as it would’ve tripped all the alarms. So in this sense our measurement (in order to keep the company running) was no longer doing the exact same thing users do… It did most of it, but had to pass fewer checks. What if those checks we skipped had bugs? We’d miss reporting their downtime. So for these reasons we turned away from synthetic availability/uptime measurements.

White Box testing: track business operations of real users

These shortcomings above made us consider turning away from synthetic load and blackbox testing to measure organic load of real users. This approach is more representative of user-experienced availability and uptime, and scales much better for our vast product offering. However it comes with its own trade-offs.

Traffic is no longer perfectly even, as people tend to move money in their business hours, so we have quite the seasonality depending on which part of the world is in office hours at any given time. So uptime and availability is no longer the same nominal number.

Measuring uptime with real world traffic is not as straightforward as availability.

As discussed above, given that people often turn ratios into “downtime seconds”, we believe it is better to report uptime to support this subconscious translation. In the next sections we will focus on getting uptime/downtime from real user actions.

Most real users interact with our API through our clients, which have aggressive retries and fallbacks to make their experience smooth even if the backend is having issues. Transient API errors are not reflected in our website/mobile app experience. However this is very hard to factor in. From our backend’s perspective, retries are independent requests; one fails, the other succeeds. We also have a large customer base who are directly consuming our production API (Wise Platform is used by many partners across the globe), so we cannot rely on their retries.

All in all, counting even the transient errors is accurate for direct API customers, and even though it is a less nominal uptime than the ones our web and mobile users actually experience it is a sufficient “at least this good, probably better” number

Downtime vs. approximate downtime of operations

Calculating real downtime from real (non-generated) traffic needs us to enumerate all requests and count the times elapsed between first failed and the next successful request. This is only easy if you have a single queue of requests. Distributed systems designed to be redundant and reliable are just not that simple — they are never “completely down”: some parts might work while others don’t at the same time.

A simple way of approximating downtime is to aggregate away the traffic-pattern of availability. This will not produce the exact same result, but is much easier to calculate and reasonably similar.

Given a short time window (in Wise, 1 minute) we can consider the traffic “constant”. With that assumption we can represent the uptime of an API endpoint for a short period by the availability of it. To get its downtime, we simply subtract the uptime from the window length itself.

The smaller this time window is the better the approximation. 5 minute window is commonly used. Wise used 2 minutes for a few months, and then we have improved to use 1 minute windows.

From there, to get the downtime of a meaningful period we simply sum up the downtime of the small windows the period consists of.

Aggregate downtime of operations to flows

As described above we have a good way of approximating the downtime of each and all important business operations (that each have a single API endpoint we can measure the availability of). The general public (customers, auditors, stakeholders) want a “Single number” representing the full business flow uptime — not 8–10 uptimes for the operations it consists of. To get that reportable number we need to make a trade-off:

First of all let’s declare what do we mean by “down” for a business flow: The business flow is down if at least one of the operations it consists of is down.

Option one: “the sum of downtimes of all operations in a business flow is the overall downtime of the flow”.

This assumes that each operation was down at strictly separate times. I.e. at all times there were exactly 1 operation down.

This is a good “worst case” scenario. It is the equivalent of “at least this good, but probably better”.

Option two: “the highest of downtimes of all operations in a business flow is the overall downtime (i.e. minimum uptime represents all)”.

This assumes that all operations were down at exactly the same time. No exceptions. Either everything worked or nothing.

This is a kind of “best case” scenario, and for simpler business flows it might be the closest to the truth. Engineers don’t really like it as it is very rare that everything is down. Mostly this is too kind to the business. It is the equivalent of “we were down for at least this amount of time, and maybe some more, but this is the biggest contributor”

Option three: “the average of downtimes of all operations in a business flow is the overall uptime”.

This aggregation tries to make up outliers by averaging them out. It is used by some key players, e.g. Revolut does “Daily uptime is defined as the percentage of minutes in the day where the average technical error rate was below 5%.”

They do an extremely detailed breakdown though and they are transparent about this averaging.

At Wise we believe it is always too kind to the business to do averages of business operation uptime/availability for operations that are part of the same business flow, as it will always be more “up” than option one, which we know is a good minimum.

Option four: “a middle point of the known good minimum and known good maximum”

I.e. We say we are sure the downtime of a business flow is always between the sum of its operations (option 1) and max of them (option 2). So we get the middle point (average) of this number, and highlight our confidence interval.

Example:

Uptime of business operations for 2024 Q2:

Minimum uptime of any single operation: 99.958%
Sum of all downtimes: 7514.5 seconds (There were 7862399 seconds in 2024/Q2) making uptime 99.904%

So we know that our API customers were able to complete the full flow between 99.904% and 99.958% in 2024/Q2.

The middle point of it is 99.931% (with +- 0.027% confidence band)

How Wise will report business flow uptime from now on:

We will shift where we can to report uptime, not availability
We will report our API uptime regardless of our client-side web/mobile retries as we have a large direct API customer base and deciding a threshold of transient issues would just cause more confusion for a slightly greater nominal uptime value that only applies to some of our customers.
We track our uptime per business operation
We aggregate that number for business flows:
We add all downtimes of all operations of a flow to get the worst possible value (assuming exactly one of them were down the same time)
We choose the maximum downtime of a single operation (assuming all of them were down the same time)
Then since neither exactly 1, nor all of our operations are down at the same time, we average these numbers to get a middle point and report the difference between min/max and average for transparency.

Onwards.