Implementing Availability SLOs in Typeform

Published in

Typeform's Engineering Blog

8 min readApr 4, 2022

If the product you provide is not available, it doesn’t matter how well designed it is, what flashy features it has, or the value your customers derive from it. This is true if you are selling physical products. Imagine selling lamps: if they are out of stock, all your hard work designing, manufacturing, testing, and delivering lamps is worth nothing to your potential customers.

This availability issue is even more critical when you’re providing software as a service. If your product is unavailable, both your potential customers are affected, but also your current users. It would be like if all the lamps you have ever sold suddenly stopped working at the same time.

When we look at the complexities of a software product, your offerings can be unavailable in different ways: some parts may be working, while others are not. Sometimes the platform is working perfectly for most of the users but not for others. It can also happen that it briefly goes down and back up again before you had time to understand what went wrong in the first place. These brief and non-exhaustive examples show why software availability is tricky to define, measure, and improve.

William Thomson Kelvin, a British physicist, and mathematician, once said: “What is not defined cannot be measured. What is not measured, cannot be improved. What is not improved, is always degraded”.

So, how do we define availability at Typeform?

I wish we had a quick answer for that, but we don’t. So instead, I’ll do my best to walk you through the — ongoing — process we developed to make sure our users, clients, and partners can rely on Typeform for their critical missions.

Why our previous solution didn’t measure up

We were not flying blind before implementing our current solution. Historically — 2018 is ancient history in software — we’ve built an internal tool called Deep-purple. It would serve as a synthetic monitoring tool for our user-facing features. This tool is still alive today and goes through more than 20 critical user flows in production every minute — login, sign-up, submitting a response, etc — and alerts on-call engineers if any of our features are not working as expected.

When we needed to get some initial visibility on our availability, it was a good choice at the time. We could collect the results for each user flow, and represent them in a dashboard. It would give us a sense of the availability of these features.

Availability chart of the most used features, measured by our synthetic tool

We also had individual monitors that would count the number of errors a service had, based on real requests, and calculate its availability. But they lacked actionable visualizations, automation to add new services, and a meaningful way of aggregating the results across the platform.

Our quick shopping list of requirements

Precision: As the quality standards of our organization increased, one of our objectives was to raise our availability to 99.9% for all of our backend services, with the aim of eventually raising this to 99.99% for our critical platform services. Within the span of a given quarter, we wanted our critical services to eventually be up and working 99.99% of the time. This meant we would only have an allowance of 13 minutes of downtime over 3 months to hit that target. Our synthetic tool was not precise enough to give us the type of granularity we wanted. A single request per minute was not a big enough sample.
Actionable metrics: Our goal was to assist teams on their journey of continuous improvement, they needed to be able to identify issues accurately and with enough context to solve them.
Good visualizations: It was important to understand what was going on in the whole platform at a glance, but also be able to go into detail when we spotted something wrong.
Automation: The solution would need to keep the maintenance toil to a minimum.

Calculating Service Level Indicators

A note on SLIs/SLOs: if you find these terms confusing, I would recommend taking a quick look at Google’s SRE handbook section on Service Level Objectives.

We realized that our way of calculating availability did not serve the objectives we were setting for ourselves. We started looking for resources and stories from other teams in the industry sharing experiences measuring SLIs for large systems. Eventually, after some research, internal discussions, and testing, we settled on some initial criteria:

Where to acquire the data: at each individual service. If our metrics were to be actionable, it made sense to initially group them by services that were owned by individual teams.

Frequency of measurement: every 1 minute. If our goal was going to be in minutes of downtime in a quarter, this was the lowest frequency we could use. So, every minute we would evaluate a server, and log a minute of up or downtime depending on the result. This differed from the previous solution because we now rely on real users' behaviors instead of a synthetic tool. It’s more accurate to evaluate with tens of thousands of requests per minute than a single one.

Error threshold: We settled for 5%. There was no magical formula here, we observed how the numbers for some of our services were behaving and came to the initial conclusion that it was a significant enough number to determine that something was wrong.

With that in mind, we set out to work on implementation…

From theory to practice

Trying to keep our implementation as simple as possible, we chose Datadog’s SLO monitors to handle the heavy lifting of keeping scores of our availability. We could feed our parameters into their system, and we already integrate with Datadog for our services using our K8s’ Istio service mesh.

We then created a dashboard in Datadog and built an automation that would update it daily. With that built, nothing changes when a team wants to add a new service. They just need to follow their regular flow, by creating their service in our central repository for Kubernetes definitions. Then the automation would take it from there.

In the dashboard each of our back-end services would get a 3-tile row like these:

Example visualization of a service that complies with the availability SLO.

Example visualization of a service that doesn’t comply with the availability SLO.

The visualization we got was great for engineers. They could click on an interval of time and see the query results in detail with the minute-by-minute precision they needed. But it was not that insightful for engineering managers; for them, it’s important to understand at a glance what is going on on the platform at any given time. That’s why having a dashboard with dozens of widgets made it difficult to reason beyond the detail. We had to find a way to aggregate these results into something meaningful.

We also found that Datadog had some other limitations. It only showed 7, 30, and 90 days windows for availability, and only kept the evaluation results for a maximum of 90 days. If we wanted to see the evolution over time, we needed to keep the results for longer.

Different customers, different solutions: A high-level view.

Now that we had our service-by-service dashboard in Datadog, it was time to make this information consumable from a higher perspective, to understand availability at the platform level.

Using Datadog’s SLO API we’ve found that we could extract all the information that we needed daily. We could then store it and use it to create a new visualization that satisfied our use case. We used Looker to create the visualization.

At this stage, we decided to aggregate the results by calculating the percentage of back-end services complying with the 99.9% availability SLO. One drawback is that it was not differentiating between critical and non-critical services and it did not map our services’ SLIs to user-facing features. It encouraged us to raise all services to our availability SLO, even non-critical ones with low traffic. As a consequence, the user-facing features would be improved down the line. We also kept around our previous solution, so we still had some visibility from the user’s perspective.

This gave us the following views:

Screenshot of Looker dashboard showcasing Service’s SLOs

We can see that 96% of our services are above the 99.9% SLO based on their individual results for the last 14 days.

Chart that aggregates the # of services that comply with the availability SLO over time.

We also have a trend line view, so we can easily see if we are improving or degrading over time.

Chart that shows the % of services that are above the SLO per team

We have a view by team ownership, so we can easily identify where we are below our target, and teams can plan for capacity to make improvements.

Chart showing the # of services that did not meet the SLO on each day over a period.

Lastly, we can track the number of services that did not meet the SLO on any given day, or click on a day to see a list of services. This helps us to understand the technical impact of an incident, and look for patterns between services.

What did we learn?

Treating this solution as a product helped us a lot: we benefited from continued interest in using data to measure the consumption of the visualization, acting on the feedback we got from users, and working on iterations.
It’s normal not to get everything right from the start. Our query needed to be updated a couple of times to improve accuracy — and it took some time and help to identify and solve the issues.
From the beginning, this project was a joint effort between different teams that offered their knowledge and contributions to make it happen. The collaboration between Infrastructure, Tools, Architecture, Product, and Intelligence showed us how powerful it can be to break silos.

Solid results so far

We are very happy with the results that we have seen with this new system:

Teams monitor this information regularly to identify problems and plan for improvements.
Engineering Managers have a quick way to understand how we are doing and have given good feedback on the system.
Our overall availability numbers have been consistently improving since we introduced this solution almost a year ago, and now all of our services are above the 99.9% goal.
We have had very few issues with the accuracy and reliability of the solution itself.