Managing Your Margins for Business Success via Error Budgets

Error Budgets Matter More Than Ever Before with distributed systems

Amit Sharma
Engineered @ Publicis Sapient
7 min readApr 14, 2023

--

You’ve got a business product in your organization having hundreds of micro-services/jobs, and you have been running it in production for some time now. Great work!

But you have found out that there are many of these in production which keep on failing either due to frequent updates from development teams or due to some underlying code issue that is not caught in test cycles, and keeps on coming occasionally for a specific production scenario. This makes you unhappy, since now your product is not only unreliable, but you are also burning an extra chunk of effort/money to fix the issues after the fact. On top of that, you are under pressure to deliver more customer features on this product.

Which makes you think that you need to balance innovation and stability. Now you can be a Google, Netflix or Uber of the world who have error budgets to manage the reliability of its platform, or can be a business like airlines or retail stores, which has suffered massive outages to tank their reputation in a minute.

Think of Error budgets as they can improve the release process which will help you improve the reliability of your product, and have a better customer experience. It will also help you decide whether new features can be launched in production based on if the applications are currently running with few or no errors, or if all launches need to be frozen until the number of errors is reduced to an acceptable error budget.

ROI of error budgets is that they help businesses avoid costly downtime and lost revenue. By setting and tracking error budgets, companies can ensure that their systems are reliable and available to customers. This can lead to increased customer satisfaction and loyalty, which can ultimately drive revenue growth. Additionally, preventing downtime and outages can help businesses avoid the costs associated with fixing issues and responding to customer complaints.

The Key to Building Resilient Systems

Site Reliability Engineering (SRE) and Error Budgets are concepts pioneered by Google. By definition, Site Reliability describes the stability and quality of service that an application provides to the end user. SRE improves collaboration between development and operations teams. Developers often make rapid changes to an application to release new features or fix critical bugs. On the other hand, the operations team must ensure seamless service delivery. Hence, the operations team uses SRE practices to closely monitor every update and promptly respond to any issues that arise due to changes.

Having said that, no product or application can be 100% reliable. Error or failures are imminent and Error Budgets are the basis for measuring the Reliability, Throughput, and Performance of an application. An Error Budget is downtime that the system or systems may face without violating any contractual terms in the SLA.

For example, if your Service Level Agreement states that your system will be available 99% of the time, then the Error Budget for your system’s failure will be 1% of the time. If the application downtime exceeds the Error Budget, the development team devotes all resources and attention to stabilizing the application first.

Error Budgets 101: Understanding and Implementing

Let’s get a little more detail-orientated.

Below are 3 key SRE metrics that are needed to calculate the Error Budget:

SLI: The Service Level Indicator is a measurement the service provider uses for the SLO goal. It is a quantitative measure of some aspect of service level such as Latency, availability, error rates

SLO: The Service Level Objective is a goal for a component that a service provider wants to reach. It is a target value or a range of values for a service level that is measured by an SLI

SLA: Contract with customers that include consequences for meeting or missing the SLO defined in the agreement

The formula for calculating SLI:

SLI =[Good Events or Bad events /Valid Events] *100

99% SLO is equal to 1% of the Error Budget

Here is an example:

Consider you have a real-time application that processes customer orders and updates the backend system in real-time. This job is critical to business as reports are being refreshed every minute to display the total number of orders which is reviewed by the Store Manager every hour. You want to measure the availability of this job and calculate Error Budget for it to make sure it is up and running 99% of the time. So, in this case:

SLO: SLO will be 99% i.e. this job needs to be available 99% of the time

SLI: Records processing error rate <1% every 5 minutes i.e., total number of failed orders/Total number of orders*100 should be less than 1% measured every 5 minutes

Timeframe: This needs to be calculated for the last 30 days of the time frame

Now, based on the above values, Error Budget can be calculated for this application

Calculations

Number of times application to be measured in a day if the error rate is calculated every 5 minutes = 288 instances (Formula: 24hrs*60min/5)

Number of times application to be measured in 30 days = 1440 instances (Formula 288 instances *5)

SLO = 99%

Error Budget = 0.01 for 1 instance (Formula: 1 -SLO/100) i.e., Error Budget for an application that needs to be measured every 5 min for 30 days = 14.40 (1440*0.01)

In other words, this application instance measured every 5 min is allowed to fail only ~14 times in 30 days. After 14 failures, all Error Budgets would be considered consumed and the development team should devote all resources and attention to stabilize the application. No newer feature of this application should be allowed to deploy in production.

Measuring Success & Managing Priorities with Error Budgets are getting more common now

The release velocity is controlled by ensuring that the SLOs are met or fulfilled before planning a new release. By measuring the consumption of the Error Budgets, you can control the rate of deployment. You can either slow down or speed up as per the remaining error budget. As per the above given example, for application 1 which has an availability SLO defined as 99%, the Error Budget comes out to be as 14 per month. Which means that all new features for this application can be deployed, if it doesn’t fail more than 14 times per month in production.

The SRE team needs to understand the product requirement, and what the developers expect the SRE engagement to achieve. Both teams must strike the right balance so that developers don’t just keep working on new features or changes that let applications make highly unreliable in production and exceed the Error Budget. Whereas, when engaging with a developer team, SREs should also build a deep understanding of the product and business goals as well so that new changes don’t keep piling on in the backlog.

Teams need to regularly talk with each other about business and production priorities. The SRE and developer leadership teams should ideally work as a unit, meeting regularly and exchanging views about technical and prioritization challenges.

Developers also need to commit to dedicating a reasonable percentage of engineering time to fixing and preventing the things that are breaking reliability: resolving ongoing service issues at the design and implementation level and including SREs in new feature development early, so that they can participate in design conversations.

Error Budget can also be adjusted based on the impact on service reliability, to make sure that a new feature or a change is released as quickly as possible, but within Error Budget limits.

Ways to monitor applications and report Error Budget

Error Budget can be applied to any type of application, such as web, mobile, or data applications. These applications could run in real-time, batch, or have API interfaces. There are multiple ways and different tools that are available in the market which can be used to monitor your application and store your metrics data to calculate the Error Budget.

Example architecture:

Conclusion

Effective monitoring and defining an Error Budget for your application is just as important as making good technical design decisions for the application. Once defined, it helps you to:

Eliminate risk: Working within Error Budgets can eliminate unnecessary risks and give your teams a threshold for reliability when pushing new features or changes to production

Accountable: With SLOs in place, there will be shared ownership between the development, operational, and product teams

Prioritize: With an Error Budget, you can focus and prioritize your efforts in the right direction

Error Budgets provide a tangible way for business leaders to measure and manage system reliability. By prioritizing reliability, businesses can reduce costs, improve customer satisfaction, and ultimately drive growth.

Contributors :

Sandeep Bhanot: Senior Specialist Engineering

Amit Sharma: Head of Engineering

--

--

Amit Sharma
Engineered @ Publicis Sapient

Certified Cloud Architect and an Expert with 18 yrs. of experience in the design and delivery of cloud-native, cost-effective, high-performance DBT solutions.