Achieving Transparency into the Health of Your Services: Thoughts on the Use of the Prometheus Monitoring Tool

Workday Technology
Workday Technology
Published in
11 min readDec 21, 2017

By Owen Sullivan, Software Development Manager, Workday

Ouch. During a family holiday to the uber-playground that is Disney World Florida in June 2015, we spent a day at the NASA facility in Cape Canaveral. SpaceX was launching the unmanned Falcon 9. The first two minutes of the flight went perfectly smoothly. Then the Falcon 9 exploded, raining down what I’m guessing was a good $100 million worth of dust particles. Later, root cause analysis would determine that insufficient quality monitoring by a SpaceX supplier resulted in a 2-foot-long steel strut certified to handle 10,000 pounds of pressure failing under a load of 2,000. This allowed a canister to float freely inside a liquid oxygen tank — a situation that wasn’t likely to result in a happy landing. Two years later, I was back at Cape Canaveral watching Falcon 9 launches on two separate days; both launches were aborted seconds before launch. Not great results, but I suppose it is rocket science, after all.

The monitoring failure that led to the explosion that day at NASA came to mind as I sat down to write this blog post about our monitoring experiences over the past year. One of the things I like most about my job at Workday (apart from the rooftop garden where I can have breakfast overlooking the city) is that despite the fact that Workday is a world leader powering the human capital management, financial management, and payroll for some of the biggest companies in the world, I still get the independe­­­­­­nce to choose the right open source technologies for the problems I’m solving. In this case, after various experiments, we chose to introduce Prometheus to improve metric collection by our DevOps teams.

I can’t overstate how important I think it is that you have accurate metrics that provide transparency into the health of your services. Actually that’s not strictly true — for example it would be an overstatement to say I’d rather you have accurate metrics than I win the lottery… But you should probably have them — the largest software companies run their business using metrics.

Take Amazon, for example. They started out selling books online. If there is a lower-margin, more competitive business out there than selling books I can’t think of it. No, wait, I can — it’s selling CDs. Which was Amazon’s second business offering. So if Amazon started out competing in commodity markets without uniquely differentiated products, how did they become the world’s most valuable retailer and the world’s most popular Infrastructure-as-a-Service provider? There are many factors, but one of them is a relentless focus on measurement at the upper percentiles of all the metrics that impact success, and then going through a continuous flywheel of measure–assess–improve. Jeff Bezos noted in his 1997 letter to shareholders that “We first measure ourselves in terms of the metrics most indicative of our market leadership.” That focus on metrics and corresponding data-driven decisions continues to underpin their success to this day.

Google is similar. Google’s Site Reliability Engineering (SRE) teams define four “golden signals” for services — latency, traffic, errors, and saturation. These signals are the minimum measured for all Google services, commonly using an in-house platform called Borgmon. SRE teams use these signals for various reasons: trend analysis, change evaluation, alerting, and graphing both for ad-hoc queries and in standard dashboards. Interestingly, they use this level of visibility not with a goal of achieving zero errors, but instead to make optimum product decisions. For example, they allocate error budgets.

These budgets are not zero, so if a team is delivering within the error budget, they are free to deliver features at a faster cadence. Teams who have spent their error budget, on the other hand, are expected to reduce their delivery cadence to get back on target. Teams that are over budget use Borgmon to provide visibility into their error rates and other metrics used to fine-tune their delivery cadence. Prometheus is the open source equivalent of Borgmon. The founders and lead contributors to Prometheus have come from Google and leveraged their Borgmon experience when creating it.

Facebook also invests heavily in metrics. Facebook built its own in-memory time series database called Gorilla (and its open source cousin Beringei) that achieved a 70 times read latency reduction versus its previous on-disk time series database. Facebook achieved this benefit by taking advantage of unique properties of their monitoring time series data:

  • A key monitoring use case is to understand what is happening right now. As a result, Facebook found 85 percent of all its monitoring queries was for data less than 26 hours old. So Gorilla was implemented as an in-memory write-through cache that handles only the most recent 26 hours of data. This solution eliminates disk latency for graphs rendered against Gorilla data. The data is still also persisted to disk, but crucially that persistence doesn’t impact the read path latency for data from Gorilla.
  • Users viewing a monitoring graph are typically viewing aggregated data, as opposed to individual data points. That is, the time series database stores an aggregation of (for example) one minute worth of data points, and the user is seeing this as a single number. Users can therefore tolerate small amounts of data loss. If the aggregation is across 99 instead of 100 data points this is likely to be immaterial. This is very different than for say a banking application, where loss of data is highly impactful. Taking advantage of this, data is streamed to multiple Gorilla servers, and no attempt is made to keep them in sync.
  • Time series data is by definition time related, so it makes sense to send it on a regular basis. This is exactly what most applications do, so time series data most commonly arrives every 60 seconds. Therefore instead of storing the timestamp of each data point, Gorilla stores its delta from the previous, in practice compressing 96 percent of timestamps to a single bit.
  • The value of most data points is similar to the previous received, and typically an integer. So instead of storing the value of each data point, Gorilla computes an XOR of the current and previous value, in practice compressing 51 percent of values to a single bit.

Prometheus includes optimisations similar to Gorilla including the storage of timestamps using a delta of deltas concept, use of caching and independence of servers. The above four bullet points also apply to Workday monitoring data, and the Prometheus optimisations here are accordingly useful to us.

Workday also values and leverages metrics to drive success. This includes business metrics — for example we publicly measure customer satisfaction (achieving 98 percent in 2017) — and metrics that drive employee happiness (which helped us achieve #1 Best Workplace in Ireland and #3 Best Workplace in Europe). We also continuously iterate to drive service improvements through our focus on key service metrics.

To that end we introduced the use of Prometheus this year for our DevOps teams to generate metrics. Prometheus is without doubt the finest open source monitoring solution available today. (Author’s note to self: This comment is both inflammatory and unjustifiable. There are many other solutions available that fit various use cases at least as well including Telegraf, CollectD, StatsD, etc. It’s like saying emacs is a better editor than vim — bound to annoy half the readers and start a flame war for no good reason. I must remove the comment before the final draft. And by the way vim is far better than emacs… :))

Prometheus includes an extensible set of “exporters” supporting exporting of metrics from specific technologies such as the MySQLd exporter for capturing of MySQL specific metrics (discussed further below). It allows labelling of metrics with metadata — Workday uses labels to tag metrics with information (such as region the host is in), making it easier to filter metrics subsequently when graphing them. The design of Prometheus is that it uses pulling of metrics by the server as opposed to the clients pushing them. There are proponents on both sides of the pull vs. push architecture debate, but pull does have the advantage of less likelihood of the server being swamped during traffic spikes. This is especially important given that you need your monitoring solution to work best when the network is under pressure.

Getting up and running with the Prometheus client library is straightforward. Here is a complete demo Ruby app that uses middleware called Rack and code that returns a simple string. To run it type “rackup” in the folder in which these files exist, and then browse to http://localhost:9292/metrics to see the resulting Prometheus formatted metrics.

Many of the Workday Scala developers make use of Akka to provide concurrent solutions, so we want Akka related metrics to give us visibility into this aspect of our applications. There is a monitoring tool called Kamon which can be used to generate metrics from Akka apps, which also includes a Kamon-Prometheus bridge to deliver these metrics to Prometheus. We experimented with the Kamon-Prometheus bridge but decided to use the Prometheus Java client instead. We open-sourced the work we did to support Akka Actor Group (and some other changes) to make it easier to deliver Akka metrics into Prometheus — you can find the source code here.

Baby Bear, Polar Bear, Brown Bear, and Billy.

[Side note: I got to this point in the blog post last night when I got interrupted by my dog starting to give birth. So between then, and now, I have for the first time delivered puppies. Not sure this adds any value to the blog, but I thought you should know.]

Network devices such as Juniper firewalls, F5s and Cisco switches expose information about the device through the SNMP protocol. Such devices maintain a tree of object IDs (OIDs). If you issue a request through SNMP for a given OID the device will return the current value for this OID. For example the current length of the output packet queue on Juniper firewalls is available by querying for the OID 1.3.6.1.2.1.2.2.1.21. The Prometheus SNMP Exporter, or SNMP Exporter, exposes a HTTP endpoint that you can query to obtain such metrics. Upon receipt of a request to this endpoint the SNMP Exporter will issue an SNMP-compliant query to the device and return the result as Prometheus formatted metrics.

We use the SNMP Exporter to give us visibility into the current health of our network devices. One small difficulty we encountered here was that the SNMP Exporter relies upon MIB files as codebooks for generating appropriately formatted configuration files from input OIDs. So, work was required to have the correct matching MIB files and OIDs.

However, the more challenging piece of the SNMP Exporter puzzle is determining which are the interesting set of OIDs for each device manufacturer, given that the OIDs supported vary from manufacturer to manufacturer. Additionally, the result of a query for a given OID might be a simple string or gauge. But another OID might be a table of data which you then must process to get whichever piece of data you are looking for. So you have to understand the format of the data being returned for a given OID.

When you have obtained the correct data via the SNMP Exporter, you then need to determine the appropriate thresholds at which to alert on-calls. For example, if the temperature of the device has increased, it might mean a fan has stopped working, which could eventually impact the working of the device. This is something we might want to generate an alert for so that we rectify this situation. So we now must decide what is a normal operating range for temperature, and what is the threshold we should generate the alert at. “Number of inbound packets that contained errors” is another example — we must decide on an appropriate value to generate an alert for devices in our network. Select a value too high and we have noisy alerting, select a value too low and we might be slow to respond to network problems.

Another useful Prometheus exporter is the Node Exporter, which provides access to hardware and operating system metrics exposed by UNIX and Linux kernels. The Node Exporter supports the concept of “collectors” that provide access to sets of metrics. For example, the CPU collector provides access to CPU metrics. One minor limitation is that collectors are all or nothing. That is, if you enable the CPU collector it will generate metrics for all CPU statistics it supports, which typically results in more metrics than I at least have needed.

The Prometheus Blackbox exporter provides the ability to query endpoints over various common protocols including HTTP, TCP and ICMP. In its basic usage you can tell if a given endpoint is up, and alert if not. It also allows the possibility of including parameters and HTTP headers of your choosing and provides you access to the response returned. This opens the possibility of deeper testing of the endpoint such as parsing the response to check for expected headers or strings. There are also other Prometheus exporters we’ve used, but they follow the same high-level principles the above exporters do, so I won’t go into them here.

The most significant limitations of Prometheus for us are the things it doesn’t do. Out of the box by itself it doesn’t provide a solution for long term storage of metric data, so we have integrated a solution for that. It doesn’t provide solutions for things such as event logging (storage and working with individual log entries) or alerting (the actual paging of on-calls to fix something). So Prometheus only provides part of our monitoring solution, but as a metric gathering technology it works well.

Peter Drucker said “If you can’t measure it, you can’t improve it.” The use of Prometheus and other monitoring changes we have introduced have made it easier for Workday teams to add new metrics to their services. It has also allowed us to increase the granularity with which we collect metrics. This provides our service teams better visibility into the real-time health of their services, in turn enabling those teams to drive measurable improvements to those services. Those improvements have included a faster response time to critical alerts and easier troubleshooting of problems. This then helps our developers rapidly innovate to bring to market new solutions such as the Workday Cloud Platform and Workday Data-as-a-Service.

I’m going to wind up this blog post now, mainly because I’ve got to tend to the puppies. But contact me to chat further about any aspect of monitoring. Or puppies.

--

--