Defining Service Level Objectives for Open Source Software

Adam Hevenor
4 min readOct 8, 2017

--

With the publication of Google’s Site Reliability Engineering book many software teams are working to improve their ops practice and start measuring their reliability. At Pivotal, we were lucky enough to work with the Google CRE team (these are Customer facing SRE’s) who provided an Application Reliability Review (ARR) of Pivotal Cloud Foundry. As a Product Manager this was a tremendous gift and helped to chart out a roadmap for the Loggregator team for the last year. The actual ARR document is private, but Matt Brown who lead the effort did an outstanding job summarizing the process on the GCP blog.

The SRE process is designed for teams developing and operating their own software and encourages sharing of practices across Engineering and Ops teams. The foundation of the approach is the error budget. An error budget allows teams to make informed decisions about new feature work vs reliability and performance. As part of Cloud Foundry Loggregator is Free Open Source Software distributed as is with no operational guarantees or warranty. Outside of performing regular updates most Cloud Foundry operators are not adding new features, which makes the error budget hard to conceptualize. Additionally most Operators are looking to fulfill specific business needs when it comes to availability and reliability.

That said, the Loggregator team has found by embracing Google’s SRE practices we were able to improve learning and feedback from Operators about our product, develop tooling that Operators can easily build upon, and better asses priorities for making Loggregator easier to operate and scale. All this starting from a point of little trust or understanding among Operators about how Loggregator functions and what expectations to have for reliability.

We did this by first working locally. Pivotal has a small SRE team that is responsible for operating Pivotal Web Services(PWS) and Pivotal Tracker, both hosted on Pivotal Cloud Foundry of course. Our first order of business was to determine a specific measurement to use, and instrument tooling to measure results objectively and autonomously.

This sounds simple but picking a technique for monitoring an infrastructure product is nuanced and challenging. Loggregator also offers several interfaces, some of which require appropriate architecture down stream. How could we really know things were working as expected? While there are several ways to interface with Loggregator (Nozzles, Syslog Drains, etc.) the experience most App Developers and Operators are familiar with is the cli command `cf logs`. Additionally, when supporting customers, we regularly used cf logs as a smoke test to ensure things were up and streaming. This makes for a good user experience to test, but does not tell us specifically what to measure.

Since Loggregator favors message loss over application back pressure (see Common Misconceptions about Distributed Logging System for why) the measurement most commonly questioned is reliability -i.e how many logs were delivered vs how many were dropped. Considering this is the most common interface, with a metric most Operators care about we used this Service Level Indicator to define our Service Level Objective. That said, this measurement relies on several layers of infrastructure and other operating components to be successful. Any objective we set would ultimately be dependent on the uptime of the infrastructure, UAA and Router components of Cloud Foundry. It’s tempting to use probes that could measure reliability without these dependencies — but this is not a real world UX and hence a false indicator. Using a blackbox monitoring approach we found that we could apply the following steps to determine a Service Level Objective.

  1. Provide tooling that can measure specific indicators
  2. Measure that specific indicator for a period of time (e.g. 24 hours).
  3. Add an additional 9 for reliability over that time period.
  4. Expand time period and repeat.

The goal of Pivotal Web Services is to test new features of the platform so we have the luxury of a larger error budget, but our goal is to provide a roadmap for higher reliability. Using this approach our Service Level Objective for Pivotal Web Services is 99% reliability over the course of 1 year — we are currently achieving that goal over approximately 6 months of measurement (we don’t yet have a years worth of data). We are able to regularly achieve three and four 9’s over a 1 week period but have not been able to do so over a 30 day period where more platform changes have occurred. Largely this gives operators an error budget for managing and scaling dependent infrastructure but has also given the Loggregator team an objective measurement when determining future work.

Reliability measurements across 5 separate platforms. Recent product updates have improved reliability.

Ultimately, defining a Service Level Objective has been an iterative process that has helped us as to gain more empathy for the Operator experience. Working closely with Pivotal’s Site Reliability Engineers and focusing on a single metric has allowed us to be effective at communicating a simple goal for a complex system. Of course being Open Source means making your work available to the community so if you’d like to learn more about Service Level Objectives and get access the tooling the Loggregator team uses for measuring reliability be sure to check out the Loggregator Operator Guidebook and the Loggregator Tools repo.

--

--