Prioritizing Development Efforts with SLOs in Microservices
As technology continues to play an increasingly vital role in businesses of all sizes, it becomes essential for leaders to have a clear understanding of how well their infrastructure is performing. However, measuring the effectiveness or resiliency of a company’s infrastructure can be a complex and challenging task. That’s where Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Service Level Agreements (SLAs) come in. SLOs can help CTOs and other technology leaders define, measure, and improve the reliability and performance of their systems.
This article will explore what SLOs are, why they are important, and how SLOs can help businesses achieve service-level excellence.
Service Level Terminology
SLO, SLI, and SLA are terms often used interchangeably but refer to different concepts.
- SLO is a target that defines the level of service that a business wants to provide to its customers.
- SLI is a metric used to measure a service’s performance
- SLA is an agreement between a company and its customers outlining the service level the business will provide.
What are SLIs, and what SLIs do we use in Picsart?
Service Level Indicators (SLIs) are the key metrics to measure the quality of service you provide to your users. They represent a proportion of successful outputs for a level of service, expressed as a percentage. These service-level indicators are described in relation to SLOs, but SLIs provide real-time signals into system reliability. SLIs can measure the proportion of requests faster than a threshold or the proportion of records coming into a pipeline that results in the correct value coming out. SLIs can be any metric that you consider to be important in measuring the performance of your application or service.
Currently, in Picsart, the SLI is measured with the following system characteristics:
- Latency: Measures the time a service or application takes to respond to a request or complete a task.
- Availability: Measures the percentage of time that a service or application is available and functioning as expected.
- Error Rate: Measures the percentage of requests that result in errors, such as 5xx HTTP response codes.
- Throughput: Measures the rate at which requests are processed or data is transferred.
- Resource Utilization: Measures the percentage of resources (such as CPU, memory, and disk) used by a service or application.
The utilized SLIs are written in the Service Level Objectives (SLO) Queries, and this means that the SLI represents the numbers that lead to a result, which are the SLOs.
In order to calculate our availability SLOs, here is an SLI example that we use:
Here is a breakdown of the translation of this query, divided into three parts
We take the sum of the total hits of a service, specify its trace type, its environment, and convert it to count.
We take the sum of total errors of a service, specify its trace type and environment, and convert it to count, and afterward, we subtract from the value that we received in the previous query (1)
As a final step, we take the result of the previous two queries, divide it by the total hits of the service by specifying the service trace type and environment, and later convert it into a count.
In order to calculate our Latency SLOs, here is an SLI example that we use:
Here is a breakdown of the translation of this query, divided into three parts
We take the total requests of a service, specify its trace type and environment, and convert it to count, also by setting a threshold of < = 0.3 (Seconds). These values are for the “Good” events.
We take the total requests of a service, specify its trace type and environment, and convert it to count, and afterward, we divide it by the value of the previous query (1).
Whenever we set a low SLI “threshold,” it means that our risk tolerance is augmented, and it can be used for different sorts of services, such as services with no direct impact on users, whereas other services need lower SLI metrics, which indicate that whenever they are violated, services are disrupted.
In this graph, it is obvious that the error request (or traces) are increasing whenever latency has been increased on the service. It indicates how service latencies and requests are correlated. Evidently, users are highly impacted whenever service latency is increased as the time to use the service or application is also increased.
As previously mentioned, setting SLIs will be different by service. Some services, such as internal tools, are not required to have high SLIs, whereas user-impacting services need strict surveillance.
Choosing SLIs that are relevant to your specific use case and align with your business objectives is important. You may also need to define specific thresholds or targets for each SLI to ensure you meet your performance goals.
What are SLOs, and how can businesses define their SLOs?
Service Level Objectives (SLOs) are your goals for your SLIs. They define the level of service you want to provide your users, and they help you measure whether or not you are meeting those goals. The service level objectives help teams collaborate on a shared meaning of “availability” and “uptime.”
For example, an online retailer may have an SLO that requires its website to be available 99.9% of the time. The SLI for this SLO could be the website’s uptime percentage, and the SLA would outline the compensation that the retailer would provide to its customers if it fails to meet its SLO.
Defining SLOs is the critical first step in achieving service-level excellence. SLOs should be based on the needs and expectations of your users/customers and your business goals. To define your SLOs, you should consider the following factors:
- Service Level Objective Type: There are several types of SLOs. From those, I want to highlight availability SLOs and latency SLOs. Availability SLOs measure the percentage of time that a service is available, while latency SLOs measure the time it takes for a service to respond to a request.
- Service Level Objective Value: The value of your SLO should be based on the needs and expectations of your customers. For example, if your customers expect your service to be available 99.9% of the time, your SLO should reflect this expectation.
- Service Level Objective Window: The window for your SLO should be based on your customers’ needs. For example, measuring during business hours could better serve customers in a specific region. Consider factors like location and time zone when setting the SLO window to ensure your service delivers the expected value.
- Service Level Objective Thresholds: You should set SLI thresholds based on your SLOs. If your SLIs fall below these thresholds, you should take corrective action to ensure that you meet your SLOs. SLO thresholds are the acceptable levels of performance for a given metric or service. SLIs, on the other hand, are the specific measurements used to determine if you are meeting your SLOs.
By defining SLOs, you can set goals for the level of service that you want to provide your customers/users and measure whether or not you are meeting those goals. You should choose SLOs that are realistic and achievable and that provide a good user experience.
If you need to meet your SLOs, you can use the data from your SLIs to identify the areas that need improvement.
What are Service Level Agreements (SLAs)?
Service Level Agreements (SLAs) are the contracts you make with your users/customers regarding the level of service that you guarantee to provide. SLAs are typically defined as a percentage of time that your SLIs should meet a certain threshold, and they provide a way to ensure that your users/customers are getting the level of service they expect. SLAs typically include penalties or credits if you fail to meet your SLOs. For example, if you have an SLA that says your website should have a response time of fewer than 500 milliseconds for 99.9% of user requests, and you fail to meet that SLO, you might offer your users/customers a credit for their next purchase. An easy way to tell the difference between an SLO and an SLA is to ask, “What happens if the SLOs aren’t met?”
Why are SLO, SLI, and SLA important?
SLO, SLI, and SLA are important because they help businesses ensure the quality of their services. By setting SLOs and measuring their SLIs, companies can monitor the performance of their services and identify areas that need improvement. On the other hand, SLAs provide a contractual agreement between a business and its customers, assuring customers/users that they will receive the level of service they expect.
Generally, SLOs are important because they:
- Improve software quality. SLOs help teams define an acceptable level of downtime for a service or a particular issue. SLOs can highlight problems that fall short of a full-blown incident but don’t fully meet expectations. Achieving 100% reliability isn’t always realistic, so using SLOs can help you figure out the balance between innovating (which could result in downtime) and delivering (which ensures users are happy).
- Help with decision-making. SLOs can be an excellent way for DevOps and infrastructure teams to use data and performance expectations to decide whether to release and where engineers should focus their time.
- Promote automation. Stable, well-calibrated SLOs pave the way for teams to automate more processes and testing throughout the software delivery life cycle (SDLC). With reliable SLOs, you can set up automation to monitor and measure SLIs and set alerts if specific indicators are trending toward violation. This consistency enables teams to calibrate performance during development and detect issues before SLOs are violated.
- Avoid downtime. The software can inevitably break. SLOs allow DevOps teams to predict problems before they occur, significantly before they impact customers. By shifting production-level SLOs left into development, you can design apps to meet production SLOs to increase resilience and reliability far before downtime. This trains teams to maintain software quality proactively and saves money by avoiding downtime.
The visibility and control over the infrastructure will give peace of mind to the business.
Who defines SLOs?
Setting up Service Level Objectives (SLOs) is crucial for ensuring the reliability and availability of a service. However, the question arises about who should define the SLOs, and what factors should be considered while setting them up. While the technical team may have the expertise to define SLOs based on system performance, it is essential to take into account the business goals and metrics to ensure that the SLOs align with the overall business strategy.
Defining SLOs based on business goals and metrics involves a collaborative effort between the technical and business teams. Initially, realistic metrics can be set up based on the current status of the service. The technical team can provide insights into system performance, while the business team can provide insights into the impact of the service on the business. It is important to understand the impact of engineering improvements on the business, the return on investment, and how much to invest in these improvements.
Questions such as how much security, reliability, and how fast the web app or mobile application should perform should be answered based on research and historical data. The business team can help understand how much new users drive improvement, how much the churn increases after each incident, and how much reputation loss costs.
SLOs based on business goals and metrics are crucial for aligning technical goals with business strategy. It involves a collaborative effort between the technical and business teams to ensure that the SLOs are realistic, achievable, and aligned with overall business goals.
In the context of Service Level Objectives (SLOs), it is important to explain the error budget concept clearly and concisely, as it is a fundamental concept underpinning the entire SLO framework.
What is an error budget, and how do we use them?
An error budget is the number of acceptable errors or downtime within a given time frame before breaching the SLO. It is a way of balancing reliability and innovation by allowing for some level of service disruption while maintaining an acceptable reliability level. Once the error budget has been established, it can guide decision-making and prioritize engineering efforts. If the error budget is being used up too quickly, it may be necessary to focus on improving reliability. On the other hand, if the error budget is not being used up, it may be possible to prioritize innovation and new feature development.
The error budget is calculated by subtracting the SLO target from 100%, which gives the percentage of acceptable errors or downtime. For example, if an SLO has a monthly target of 99.9% uptime, the error budget would be 0.1%.
Managing an error budget requires ongoing monitoring and analysis of system performance. This can be done using metrics and monitoring tools that track service uptime, latency, and other key performance indicators. By understanding how much error budget is available, teams can make informed decisions about allocating resources and prioritizing work.
Now we have reached the most important part of this article. I want to talk about the importance of SLOs in microservices. SLOs play a crucial role in the success of microservices. Microservices are a software architecture pattern that involves breaking down large applications into smaller, independent services that can be developed, deployed, and scaled independently.
Why are SLOs crucial for microservices?
Microservices have become increasingly popular as organizations seek to create more flexible and scalable applications. However, with this approach comes an increased complexity in managing the many services that make up the application. Service Level Objectives (SLOs) are critical in ensuring that each microservice provides the expected level of service to the larger application.
In a microservices architecture, each service is responsible for a specific function and communicates with other services to provide the overall functionality of the application. With many services working together, it can be challenging to identify and isolate the root cause of issues. This is where SLOs come in.
SLOs help define the expected level of service that each microservice should provide. By setting SLOs for each service, you can identify and isolate issues more quickly, reducing the impact on the overall application. SLOs also provide a way to measure the performance of each service, allowing you to identify areas that need improvement. One of the key benefits of SLOs in microservices is that they provide a common language for communication between teams. Each team responsible for a microservice can define its own SLOs, but they must also work with other teams to ensure that the service they provide meets the expectations of the larger application.
SLOs also help to promote a culture of accountability and transparency. Each team is responsible for meeting the SLOs that they have defined, and if they fail to do so, they must take ownership of the issue and work to resolve it. SLOs provide a way to measure performance and hold teams accountable for their services.
Another benefit of SLOs in microservices is that they provide a way to prioritize development efforts. By measuring the performance of each service against its SLOs, you can identify which services are performing well and which ones need improvement. This allows you to prioritize development efforts and focus on improving the most critical services to the overall application.
In summary, SLOs are crucial for microservices because they:
- Define the expected level of service for each microservice
- Identify and isolate issues more quickly
- Provide a common language for communication between teams
- Promote accountability and transparency
- Prioritize development efforts
To implement SLOs in microservices, you should start by identifying the relevant SLIs to each service. You can then define the SLOs for each SLI and use monitoring tools to measure performance. By continually monitoring and adjusting SLOs as needed, you can ensure that each microservice provides the expected level of service and contribute to the success of your overall application.
Measuring SLIs and SLOs is critical to ensuring that each microservice provides the expected level of service. This is typically done through monitoring tools that measure the relevant metrics and provide alerts when performance falls below the SLO thresholds.
How to implement monitoring with SLOs and microservices?
Monitoring and alerting for microservices are essential because a single failing microservice can impact the overall system’s performance. In a microservices architecture, an application is built as a collection of loosely coupled services that communicate with each other over a network. Each service typically has a specific responsibility and is designed to be independently deployable, scalable, and replaceable.
There are various monitoring tools available for microservices. At Picsart, we use Datadog for Application Performance Monitoring (APM) for each microservice. However, Prometheus can also be used as a data source to collect metrics from all microservices.
APM is a type of monitoring that collects all relevant metrics, such as HTTP status codes, to provide deep visibility into your applications. You can monitor requests, errors, and latency with out-of-the-box performance dashboards for web services, queues, and databases. Distributed traces can also seamlessly correlate to browser sessions, logs, profiles, synthetic checks, network, processes, and infrastructure metrics across hosts, containers, proxies, and serverless functions.
It’s important to note that microservices’ status codes should not be confused with status codes used for web servers, CDNs, or DNS providers. “Service level” means the status of the microservice running in our environment, which in our case is Kubernetes. The status code of a microservice refers to the current state or health of the microservice itself. This could include information such as whether the microservice is running, available, or experiencing any issues.
After code-side instrumentations, we can collect all metrics and traces related to the microservice. Based on this, we can define Service Level Indicators (SLIs) that turn into Service Level Objectives (SLOs). There is an SLO section within Datadog for convenience, where we have created our SLOs based on the APM traces.
Once we have the metrics from APM monitoring and defined SLOs, we need to set up our alerting based on that data. Analyzing the metrics, we can configure alerts to notify the responsible team if a microservice fails to meet its SLOs. This way, the team can take corrective action to fix the issue before it becomes critical. Alerting allows you to proactively detect and respond to issues before they become critical. The only problem that we need to tackle is alert flaps. Alert flaps occur when the monitor triggers and resolves due to an error peak. What we have done in this case is that we have added the min() function, which calculates the minimum rate of errors during a given period. This way, alert flaps are avoided.
Of course, we also need to highlight the role of incident management in this flow. Incident management plays a critical role in maintaining the reliability and availability of a service. It involves detecting, responding to, and resolving incidents that may impact the service. Effective incident management helps minimize downtime, reduce user impact, and maintain the reputation of the service.
Incident management is closely related to SLOs, SLIs, and monitoring. Monitoring helps track the SLIs and provides visibility into the service’s performance. In case of an incident, incident management helps restore the service to the acceptable level defined by the SLOs. Another advantage of incident management is the creation and strengthening of the “Service Ownership” concept, which denotes the responsible people of the service, and hence the incident.
Incident management is a vast and complex topic that warrants a comprehensive article. There are various factors related to incident management processes, best practices, and tools that can facilitate incident management. Additionally, a significant aspect of incident management pertains to its close relationship with SLOs, SLIs, and monitoring. This relationship plays a crucial role in ensuring that the service meets the defined SLOs. I will dedicate my next article to it, stay tuned.
Conclusion
Integrating APM, SLOs, and Incident Management is a powerful way for organizations to optimize their operations and provide better service to their users. By automating issue identification and resolution, teams can quickly escalate incidents, minimize downtime, and improve service ownership processes. This integration enables teams to proactively identify potential issues and improve the overall user experience by establishing clear performance expectations. With greater visibility and control over their infrastructure, management and CTOs can make informed decisions based on real-time data and insights, leading to increased operational efficiency and a better user experience. By having a clear and concise reporting system in place, teams can communicate effectively with management and other stakeholders about the performance of their infrastructure and the status of ongoing incidents. Ultimately, this integration provides numerous benefits to organizations, including reduced downtime, enhanced user experience, and increased operational efficiency. After implementing the correct metrics for SLO/SLI/SLA, we are pleased to report that we have improved the customer experience and happiness; additionally, our average uptime has become 99.99%, and the incidents count decreased by 30%.