The Process for Engineering Excellence

Vince Sacks Chen
3 min readJan 1, 2024

--

In a large-scale & interconnected software system, there are upwards of millions of lines of logs daily, many of which are signals of problems of various degrees while others are noises. Increasing the signal-to-noise ratio gives engineers first and foremost a good lens through which they can understand a system’s behavior and secondly, the ability to quickly spot a problem, assess its severity, and come up with a mitigation or solution.

In a complex system, this seems like a Sisyphean task, as new services are constantly created and new logs are added, which oftentimes conflates and sometimes contradicts existing logs. The endeavor of effectively controlling the entropy of information thereby maintaining and improving the engineering observability is thus called Engineering Excellence (EE henceforth).

EE cannot be replaced by product analytics like page views and conversion funnels as they only present the tip of an iceberg, underneath which there is a much deeper and wider range of complexity. Therefore, every team ( product and platform teams alike) should dedicate a sizable portion of its time to this practice, through which it can constantly learn about and make improvements in a system’s infrastructure, architecture, and implementations.

👋

As an engineering manager at Uber, from a process standpoint, I took both a bottom-up and a top-down approach to EE. First, review all the signals collected from the on-call shift (with PagerDuty) and use them to triangulate any potential problems. Second, review all the engineering analytics (with Grafana) to spot changing trends either due to user behavior or internal causes.

All insights gleaned from incident mitigation and observability tuning eventually lead to platform investments, which are internally focused engineering projects.

💪

The on-call review meeting is commonly led by the engineer who is on-call for the week attended by the whole team, with topics covering:

  1. Incident review follow-up: should an incident report be opened? are outstanding incident-related tickets being worked on?
  2. Endpoint out-of-SLA mitigation: are any endpoints below a 99.9% reliability rate? if so what are the root causes? (Refer to this article)
  3. Regression testing follow-up: has QA or automation caught any regression bugs? are they being worked on?
  4. On-call alerts follow-up: ensure every alert is annotated; does every alert have a written runbook for mitigation? should an alert be eliminated, reduced, or ignored?
  5. Inbound support follow-up: are support questions being answered within service-level agreements (SLA)? are there questions that should be better opened as tickets?

The Eng Excellence review consists of and is not limited to endpoint latencies (in milliseconds), endpoint profile (in requests per second), exception statistics, progression funnel health, etc. It is standard practice to instrument user interaction points with metrics and have them emit signals to the analytic pipeline for data visualization. At an aggregate level, teams should review these metrics for abnormalities.

🙌

At Uber, teams are given budgets (10%-20%) for operational excellence as well as platform investment (10%-25%) every quarter to ensure continuous improvement in each product domain. In a hyper-competitive space in ride-sharing and food delivery where a bad user experience means churn, there is a clear incentive for engineering excellence. During my tenure, with the described two-pronged process, I led my team to drastically reduce the number of incidents and the response time in incident mitigation. Above all, it instills a culture of accountability and truth-seeking among the engineers — their job is not only to build great products but also to understand fully the products’ limitations and the ways to improve them.

--

--

Vince Sacks Chen

Software Engineering Manager, previously at Uber, Veeva, Fevo.