How SRE is reducing on-call fatigue at Condé Nast

For the last 3 years, we’ve been building a multi tenant service that supports multi-lingual/multi-location content from a single platform and have on boarded over 20 sites on our centralized platform; Today we serve over 300 million unique users a month and are growing as we onboard more sites. DevOps practices have been ingrained since the start as well as its tenets, such as “you build it, you own it” and that may mean being on-call.

As an Engineer myself who goes on-call, I’m not going to pretend it is the highlight of my job, especially when I’m woken up at 3am in the morning, and I can’t tell whether there’s an actual issue or if there’s no effect on users and I can’t do anything about the alert at that point in time (e.g. a spike on CPU usage on some service). Noisy alerts are problematic because this leads to pager fatigue and engineers avoiding being on-call (understandably). When planning incident management strategies, we need to think not only of the end user, but the engineers too.

As a new Site Reliability Engineering (SRE) practice, we therefore had our first challenges:

  • Reduce noisy alerts: We had too many alerts that were wrongly disturbing developers out of hours and not providing our end users with any value;
  • We were also tasked with improving Observability in order to uncover potentially hidden issues.

SLO based monitoring

Instead of going through all our “noisy” alerts and fixing them, we decided to step back and map the service endpoints that needed to be monitored.

We then asked ourselves, Product and other relevant teams “What does success mean to us?” and “What targets should we be measuring?”.

Given our first goals, we selected the indicators that would give us the most value to start with:

  • Availability: Is the website up?
  • Latency: Can users see content fast enough?
  • Correctness: Is the content correct?

Looking closely at Availability, our success target calculation would translate into:

With this availability definition, we needed to find a data source to provide these indicators. We should always aim to take the “measurement of success” as close to the end user as possible. We looked at the tools we had available that could provide visibility into HTTP performance closest to the end user. Those were:

  • Google Analytics
  • Pingdom
  • CDN metrics (Fastly)

Google Analytics is the closest to the user we have available however the data proved to be insufficient since our available precision only provides us with a small sample of all requests.

Pingdom truly represents a user-like request end to end; it probes your website from different geolocations just as a user would do. However since it’s only a sample that is probed once a minute, it could hide problems or give false negatives.

Finally, our CDN provider, Fastly, has out of the box integration with Datadog (our metrics collection tool) and exposes a real time metrics API.

We thought this gave real time availability metrics over every single user request. Nice.

We decided to trust our CDN provider as our source of truth for our Availability SLO’s. We could still use Pingdom and Google Analytics as a way to safeguard us and “monitor the monitoring source”, but it’s out of scope for this article.

As we implemented alerts based on those metrics, we didn’t see much improvement, actually, our alerts became even noisier. It turned out that our CDN to service architecture had more complexities that weren’t accounted for initially:

CDN to service architecture (simplified)

Fastly CDN allows multiple origins, which they call backends, and we make use of that feature to redirect requests to our services but also some legacy or non-supported services. By using our metrics close to the edge we couldn’t easily identify if a request would ultimately reach a non-supported backend. So if we had an alert fire, we could potentially find that the root cause was not on a service we could fix.

Remember that our focus was to reduce noisy alerts. We did not want our Engineers to wake up to deal with incidents that they could not resolve. That’s an important consideration here.

Another issue, especially related to out of hours alerts, is that in times of very low traffic, an incident could happen that did not affect a significant number of users but alerts were still fired because bots that were running at the time made the number of users impacted seem significant. In that scenario, an Engineer on-call would not provide enough value to justify being woken up. There are cases when it’s perfectly acceptable for the Engineer to investigate the next morning (provided the issue doesn’t carry on impacting users throughout the night). This shows respect to the Engineer who is only woken up to actually produce value to the business.

Finally, the built-in Fastly latency metrics provided were averages over 60s periods, not percentiles or histograms, which didn’t work for us.

Unfortunately, the metrics we needed to filter backend-specific requests were not provided out of the box along with Fastly/Datadog integration nor with their provided metrics API. We needed another source of data.

To improve the quality of signal we had, we built an application that read directly from Fastly’s exposed API and it gave us real-time metrics plus latency histograms. But we still had the backend and bot detection issues.

Fortunately, Fastly provides real time log streaming to a syslog endpoint, with a lot of flexibility when it comes to what can be logged (request/response headers, client and POP information, etc.). Those logs can contain all of the information we need to improve the quality of our signal.

This gave birth to “Project Phi”, a small but mighty custom Syslog server written in Golang to accept and process our Fastly logs.
(Why is it called Phi? Because naming is hard.)

Project Phi

Phi uses Fastly logs as our input data to:

  • Ship error logs to Kibana (any 4xx or 5xx response codes)
  • Understand which backend a request was served by
  • Calculate precise latency metrics as histograms or percentiles over sliding time windows
  • Distinguish between human and bot requests

(Phi is currently crunching through tens of thousands of log events per second to provide these for a number of our high-traffic properties)

Looking again at the SLO calculation, we now have:

After a few iterations, we were able to roll-out these alerts and move existing alerts to work hours only. We are working with engineers to decide whether old alerts should be decommissioned altogether or agree on their value for out of hours alerts.

This is just our first iteration but we’ve seen a significant reduction in incidents, and more importantly, most out of hours alerts now reveal real, actionable problems, which allow us to focus on solving real issues.

The lesson here is that besides understanding our software architecture well, we must work with Product Teams to know what is important and what success means; then dive deep into the data to find the relevant information and, if it’s not very clear, you must do some work to retrieve or produce the data you need.

A special thanks to Hassy Veldstra (who wrote the first version of Phi) and khanh nguyen for reading drafts of this article.

Jennifer Strejevitch
Site Reliability Engineer — Condé Nast

--

--

Jennifer Strejevitch
Product and Engineering at Condé Nast

R&D Senior Engineer @VMware and Co-chair @CNCF TAG-App-Delivery, previously Site Reliability Engineer @Condé Nast