Why SLAs are Critical for Reliable Data Organisations and Data Products

Published in

Data And Beyond

8 min readMar 8, 2023

In today’s world reliability is the foundation of our lives. We trust our partners, neighbors, colleagues, friends, etc to have a smooth and healthy lifestyle. In order to build reliability we make promises and commitments, and most of the time we do our best to keep them. When we can’t keep them, we face the consequences and try to find ways not to repeat our mistakes again. Sometimes the consequences can be emotional, sometimes monetary depending on the person and organization you committed to. For example;

When you make an appointment with your dentist, you commit to cancel your appointment 24 hours before the treatment, otherwise, you agree to be charged a certain amount.
When you sign a contract with your employer, you commit to working a minimum of 40 hours a week at your company, otherwise, you agree to get paid less.
When you own a dog, you commit to taking your dog for a walk minimum of 2 times a day, otherwise, you agree to clean the poop from the carpet.

Today I am going to deep dive into this topic and explain technical terminologies like SLA, SLO, and SLI and how to use them to create and nurture reliable data organizations and data products.

Three musketeers: SLA, SLO, and SLI

First, we need to understand the technical definitions of the three musketeers;

Service Level Agreement (SLA)

SLA stands for “Service Level Agreement.” It is a contract between a service provider and a customer that outlines the level of service that the provider will deliver, as well as the customer’s responsibilities and expectations.

Service Level Objective (SLO)

SLOs are internal performance goals that a provider sets for themselves. SLOs help providers to focus on delivering reliable and consistent service to their customers. SLOs are typically defined in terms of specific metrics, such as availability, latency, or throughput.

Service Level Indicator (SLI)

SLIs are used to track and report on the actual performance of a service and are often defined in terms of key performance indicators (KPIs) such as response time, throughput, or error rate.

Picture credit by Atlassian https://www.atlassian.com/incident-management/kpis/sla-vs-slo-vs-sli

A real-life example: Google BigQuery SLA

If you are a person like me who is not good at understanding concepts without examples, then you are in the right place. To see how SLA, SLO, and SLI are used in a real-life example, I am going to use Google BigQuery Service Level Agreement which is publicly available.

In the SLA, Google commits two main SLOs;

Picture credit by Google https://cloud.google.com/bigquery/sla

In order to calculate and evaluate SLOs Google gives proper and detailed metric definitions;

If the commitments are not met, Google identifies the penalty conditions as below;

As SLAs are live documents and needed to be updated due to the changing conditions of the products, I recommend you check the latest SLAs on the main web page of Google BigQuery.

SLAs are not for punishment, they are for alignment

SLAs enable service providers and consumers to understand each other and create alignments. The consumers share their prioritized requirements that should be respected and supported by the providers. In the same manner, the providers share what is possible with the current infrastructure and teams to cover the requirements and create growth opportunities for their services.

As we use SLAs for alignments, they can be used for internal and external circumstances;

Internal SLAs are created by teams in the same organization that works together and is dependent on each other's products and services; e.g. fraud team’s real-time payment transaction labeling service sends fraudelent transactions to the payment team via REST APIs to block potential fraudelent transactions.
External SLAs are created by organizations that are willing to work together; e.g. your company wants to use Google BigQuery and extend the open SLA with additional terms and conditions.

The main problem of data organizations: Reliability

Today most of the big organizations see data teams as a burden and unreliable allies. There are various fair and unfair reasons behind this decision-making process;

The inconsistent information sharing between various data products; e.g. X dashboard and Y dashboard show different results for the Z metric.
Unpredictable problems at ETL jobs cause delays in business operations; e.g. 2 days daily ETL ended at 7 am and today nothing has changed but it ended at 11 am which blocked certain operations.
Data engineering and analytics teams don’t have proper alarming mechanisms; e.g. from the data warehouse to domain product ETL jobs didn’t run in the last 3 days without no one noticing, etc.

These and many other examples create untrustworthiness among the data teams. On top of these, if the data teams become protective about their approaches and seek scapegoats then their reliability diminishes.

In order to break this vicious cycle, data organizations should be more predictable, transparent, and committed to the data products they build. To do that they should follow the best practices that they can learn from the software engineering teams, writing SLAs.

How to create Data SLAs

Creating, committing, and managing data SLAs require a dramatic mindset shift in the data organization. It can be seen as being audited and evaluated all the time. But the reality is to be more predictable and committed to the data consumers you are working with. There are various steps to achieve success;

1- Know your current capabilities and limits

SLA creation is an alignment process. During this process, the data provider and the data consumer need to know what is achievable and what is not available today. As the data provider, you need to know your team, your infrastructure, and your resources to give realistic commitments and promises.

2- Identify the data consumer

The data consumer can be internal or external users. Understanding who is going to consume your data, the use case they are going to use the data, etc will help you to optimize your strategy.

3- Collect data on consumer requirements

Making assumptions about the data consumer may be the biggest mistake. The data provider needs to ask as many questions as they can to understand what is needed for the data consumer rather than what would be great in an imaginary world;

Tables and dashboards should be updated before 7 am.
Dashboards should respond in less than 2 minutes.
Customer information should be unique, etc.

4- Define SLAs

Based on the requirements, the data provider should evaluate past information and come up with baseline metrics; e.g. 99% freshness, 99.99% availability, 99.9% uniqueness, etc. The data provider and the data consumer should align on this information before moving forward.

5- Define metrics and SLIs

After the SLAs are defined, the data provider should define the relative SLIs to monitor and measure their status. As we saw in the Google BigQuery example, the definitions of the KPIs should be clear and aligned with the data consumers.

6- Define SLOs

As we discussed in the first action item, the data provider should be aware of your limits before making commitments. With SLOs, you define an acceptable range of downtimes and failures per your KPIs. Never aim for 100% success on any KPI!

7- Share your SLA with the data consumer

At this stage, everything on paper should be aligned between the data provider and the data consumer. Before starting the next stages, make sure that you shared your SLA with the data consumer.

7- Monitor your progress

When the data provider has the SLOs in place, the next thing is to start monitoring. Grafana dashboards are pretty handy for these operations. These dashboards should help the data provider team members easily visualize, evaluate, and understand the ongoing status of the indicators.

8- Set up alarms

It is impossible to monitor the metrics all the time. Most of the time, the data provider will have multiple SLAs with multiple data consumers. The best way to handle the balance is to set up alarm mechanisms.

Alarm mechanisms help the data provider to identify the issues before the data consumers nudge the data provider team with the problems. These alarms can be shared over emails, Slack messages, etc. with their relative severity levels.

9- Set up an incident management plan

When an alarm is generated, how the data provider is going to respond to that? Does every alarm have the same severity? Without a proper incident management plan, the data provider team can find themselves trying to solve issues 24/7 which will burnout them soon. Collaborating with the on-call teams and splitting the issues will keep the team more focused and motivated!

10- Communicate issues to the data consumers

Reliability is built on transparency. There is no system that has 100% success on every metric. There will be failures and these are normal. The most important thing is how the data provider found out about it, how fast and accurately responded, which action items were taken, and how openly they communicated it to the data consumers.

11- Balance technical debt, maintenance, and new features

SLAs help the data provider team to identify the areas they should pay attention to. Sometimes improving from 99.9% to 99.99% may take months and maybe it is not needed. Understanding what is needed and what is luxury is the key to balancing technical debt and new features.

12- Regularly meet with data consumers

SLAs are live items. When they are defined, they don’t stop. The data provider and the data consumer should regularly meet to evaluate the requirements and potential improvement areas. As an example, Google BigQuery SLA was updated 3 times in the last 2 years.

13- Know your future requirements

As SLAs are going to evolve and have the potential to be more strict in the future, the data provider should curate a strategy to support future requirements. These can range from hiring to infrastructure changes.

Conclusion

Creating reliable data organizations and data products is not an easy process. Migrating to product thinking at data teams is a great first step. But it should be followed by the best practices that are applied by the technical product teams.

Today there is not a single commercial software product that doesn't have SLAs about their services. SLAs create trust among the parties and show potential growth areas for the future. Adapting this best practice among data organizations will not only create trust among the stakeholders and clients but also highlight the improvement areas for the future.

Thanks a lot for reading 🙏

If you liked the article, check out my other articles.

If you want to get in touch, you can find me on Linkedin and Mentoring Club!