SLA in Data

Published in

AbeaData

4 min readMay 20, 2024

When I talk to people about service levels, and specifically service level agreements (SLA), they focus on a system's availability, like the number of nines a system is available: 99.5% to 99.99999% (more or less).

SLAs go well above availability. Let's first look at SLA, SLO, and SLI, then jump to twelve common service level indicators (SLI).

This is particularly important in Modern Data Engineering as you think of Data Contracts and Data Products.

SLA, SLO, and SLI

Service levels have various attributes around them, let's quickly look at them.

When you have service levels, you have indicators (SLI), objectives (SLO), and agreements (SLA). Let's look at an example.

Getting a coffee at a coffee shop has implied SLI, SLO, and SLA.

Imagine you enter a coffee shop (outside Amsterdam) and ask for a black coffee. You expect that it will come hot (SLA). For the barista, they could have an indicator (SLI), which is the temperature. As an objective (SLO), they could measure that the temperature should be between 85ºC and 95ºC.

What does it mean for data?

Service-Levels Indicators for Data

As much as data quality describes the condition of the data, service levels will give you precious information (indicators) on the expectations around availability, the lifecycle, and more.

Here is a list of SLI that can be applied to your data and its delivery. You will have to set some objectives (SLO) for your production systems and agree with your users and their expectations (SLA).

Service Level Indicators

Availability (Av)

In simple terms, the question is: Is my database accessible? A data source may become inaccessible for various reasons, such as server issues or network interruptions. The fundamental requirement is for the database to respond affirmatively when you use the JDBC’s connect() method.

Throughput (Th)

Throughput is about how fast data can be accessed. It can measured in bytes or records by unit of time.

Error rate (Er)

How often will your data have errors, and over what period? What is your tolerance for those errors?

General availability (Ga)

In software and product management, general availability means the product is now ready for public use, fully functional, stable, and supported. Here, it applies to when the data will be available for consumption. If your consumers require it, it can be a date associated with a specific version (alpha, beta, v1.0.0, v1.2.0, v2…).

End of support (Es)

The date at which your product will not have support anymore.

For data, it means that the data may still be available after this date, but if you have an issue with it, you won’t be offered a fix. It also means that you, as a consumer, will expect a replacement version.

Fun fact: Windows 10 is supported until October 14, 2025.

End of life (El)

The date at which your product will not be available anymore. No support, no access. Rien. Nothing. Nada. Nichts.

For data, this means that the connection will fail or the file will not be available. It can also be that the contract with an external data provider has ended.

Fun fact: Google Plus was shut down in April 2019. You can’t access anything from Google’s social network after this date.

Retention (Re)

How long are we keeping the records and documents? There is nothing extraordinary here, as with most service-level indicators, it can vary by use case and legal constraints.

Frequency (of update) (Fy)

How often is your data updated? Daily? Weekly? Monthly? A linked indicator to this frequency is the time of availability, which applies well to daily batch updates.

Latency (Ly)

Measures the time between the production of the data and its availability for consumption.

Time to detect (an issue) (Td)

How fast can you detect a problem? Sometimes, a problem can be breaking, like your car not starting on a cold morning or slow, like data feeding your SEC (Security Commission for Publicly Traded Companies) being wrong for several months. How fast do you guarantee the detection of the problem? You can also see this service-level indicator called “failure detection time.”

Fun fact: squirrels (or similar rodents) ate the gas line on my wife’s car. We detected the problem as the gauge went down quickly, after just a few miles. How could we know if the car could even make it to the mechanic?

Time to notify (Tn)

Once you see a problem, how much time do you need to notify your users? This is, of course, assuming you know your users.

Time to repair (Tr)

How long does it take to fix the issue once it is detected? This is a very common metric for network operators running backbone-level fiber networks.

Taking it home

Of course, there will be many more service-level indicators over time. Agreements follow indicators; agreements can include penalties. As you can imagine, the service description can become very complex.

Thanks to Modern Data Engineering, those values are critical to establishing robust Data Contracts and Data Products.

Feature photo by Andrea Piacquadio.

Learn more about AbeaData and follow us on LinkedIn for additional content and updates!