How We’re Using Monitoring to Support Our Increased Availability SLA
By Owen Sullivan, Software Development Manager, Workday
At our recent Rising conference, Workday announced we are raising our availability SLA from an already industry-leading 99.5% to 99.7%. (In practice, we’ve been beating even that number, consistently delivering better than 99.9% availability.) We understand that maintaining and improving our availability is crucially important to customers. So I thought it would be a good time to talk about some of the steps being taken by our internal service teams to use efficient monitoring to help deliver on this availability SLA.
In my last blog post, I talked about monitoring infrastructure changes at Workday, including how we incorporated a technology called Prometheus. Prometheus in Greek mythology enabled progress by giving fire to humanity. The Prometheus software also enables progress — not through fire but through improved visibility into service health. Workday developers could now easily add new metrics and alerts for their service, container, or software component in minutes.
The thing is, our developers did exactly this. In spades. And like playing with fire, playing with this volume of new metrics can be tricky.
These new metrics and alerts added much-needed visibility and value to individuals and teams, but the combined effect was a firehose of data noise. The new metrics and alerts tended to stay around forever, even if they were added for a transient issue. At the same time, Workday keeps growing and advancing our product offering, including, most recently, advancing Performance Enablement through a new approach to enable deeper employee-manager relationships and equipping our customers to better plan through our Adaptive Insights-powered business-planning offering.
The combined effect made it difficult to know which consoles to look at in emergent situations. The first step we took to address this was to deploy a tool called BigPanda, which uses machine learning to correlate alerts into insight-rich incidents and present them in a single pane of glass. This makes it easier to respond to, and resolve, problems in our infrastructure.
If you can measure a problem you can manage it, so to combat the underlying noise problem, we introduced and then reported on various monitoring-related artifacts, on a per-team basis. This meant that we could now analyze the noise by team and tool, and present this information clearly using KPIs, metrics, and dashboards.
Now we’re able to measure ownership by service team, as well as by other parameters where applicable, such as Chef role and Chef cookbook name. That allows us to work with relevant teams to drive out noise that doesn’t add value. Armed with these insights, we started a program to review the monitoring used by our service teams. I broke the ice at the start of one of these reviews with a joke about how our customers are more than just a number to us — they are a sequence of many numbers that have a temporal and logical relationship to each other, stored in a time series database and rendered as a series of data points in a graphical UI. (Crickets). Jokes aside, the reviews cover all aspects of monitoring from a time-series data viewpoint — metrics, dashboards, checks, alerts, runbooks, etc.
Now, when service-impacting events occur, we can conduct down-to-the-minute root cause analysis reviews to see where we can shave off each minute of impact. Typical actions that come out of this are:
- Earlier alerting so that impact can be avoided.
- Better metrics that provide visibility into service health.
- Use of better dashboards during events to expedite troubleshooting
- Where action could not be automated, using better runbooks so on-call engineers can handle issues faster.
Side note: As I was writing my previous blog post, I was interrupted by my dog giving birth. The puppies have since grown to be big and bold, and are enjoying life to the fullest! And they will fit right in here, given that Workday was ranked by People Magazine in the Top 12 Most Pet-Friendly Companies.
Moving forward, we continue to analyze our operational performance. This includes measuring time-to-action (MTTA) and time-to-resolution (MTTR) for critical incidents, incident volume, whether the issue was detected and remediated by the monitoring infrastructure, and what the impact of the issue was. We’re also measuring how effective we are at our goal of a “Zero Inbox” for alerts across the company so we avoid the broken window syndrome. The measurements are stored in our time series database, so we can do a mathematical analysis on them similar to how we analyze our other metrics — for example, looking at a particular percentile such as P95 for a given metric.
Last October, my colleague Cristina Goldt has announced Workday’s machine learning-powered skills cloud offering as part of our vision for a frictionless marketplace. This will undoubtedly drive further growth in the use of the Workday service. In the next blog post we can discuss how this impacts monitoring. And, how the dogs are doing…