Managing Site Reliability Engineering

2 min readDec 26, 2023

As software adaptation scales, infrastructure needs to scale and so does the ability to monitor the usage and mitigate a variety of issues quickly.

At Uber, where the user base is global and the marketplace is extremely competitive — an unreliable functionality leads to suboptimal user experience, which leads to churn to competitors — site reliability is taken seriously and its responsibility is delegated across product teams and platform teams alike to optimize speed.

👋

To have a reliable user experience, most teams need to ensure an SLA (service level agreement) of 99.9% server-side API call success rates (which translates to at most 1.44 minutes of downtime per day) across a team’s external (user-facing) and internal (consumed by other APIs) endpoints.

In addition, teams need to periodically assess the out-of-SLA endpoints’ impact to the users, mitigate the endpoints, and decide whether to escalate an instance to a postmortem review, which requires a formal report and the participation of the entire engineering organization and often the senior leaders too.

💪

Every week, as API statistics (HTTP response code in 2XX, 3XX, 4XX, and 5XX) are collected, teams use an internal dashboard to tabulate the server-side request success rates. Green is in-SLA; red is out-of-SLA.

When an endpoint dips below the SLA, to assess user impact, we first need to understand what an impacted user experience is and then understand the aggregate impact. For example, “Affected user had a slower page load; 5% of 10,000 inbound requests per minute are affected.”

There’s a weekly cadence for managers of the entire engineering organization to review those violations in a collaborated document with content similar to the below:

For analysis, we need tools to trace errors. At Uber, we use uMonitor to spot erroneous calls and drill them down using Jaeger to find where in the call stack an error is thrown. To see the violations in the aggregate, we use Kibana to see the distribution of API request results by response code.

🙌

In general, there are several ways in which “errors” can escalate into an incident — OOSLA review, on-call reports (bugs caught by the on-call engineer), and QA bugs. Whereas the on-call reports and QA bugs uncover individual occurrences of a problem, the OOSLA review uncovers a broader view of it. The rule of thumb is, that when an “error” is nontrivial and a fix is not immediately attainable, a postmortem report is warranted.

These sources are often correlatable and could be cross-referenced. Through the OOSLA reviews, my team has uncovered and fixed a variety of unforeseen issues, e.g. infrastructural configuration, regional settings, 3rd party API issues, and internal platform issues.

Managing Site Reliability Engineering

👋

💪

🙌

Written by Vince Sacks Chen