Improving On-call Engineering at SailPoint

Published in

SailPoint Engineering Blog

6 min readJul 28, 2021

It’s been over a year since SailPoint moved away from having the majority of employees working in an office, moving further into our new fully-remote reality. In that year, our DevOps team has doubled in size with remote employees added from all over the world. Alongside this growth, and in the spirit of continuous improvement, we realized we needed to revisit our on-call procedures and technology. This front had not received much attention in the past as our on-call just worked, but we knew there was new technology available to us that could help us innovate in fresh new ways to improve the lives of our engineers. Having our on-call engineers essentially be responsible for the entirety of the supported DevOps landscape no longer working nor was it scalable. On-call was becoming exhausting, and the breadth of our product and amount of monitoring necessary was becoming too much for a single engineer to handle. Because our on-call process was managed entirely through PagerDuty it seemed logical to look into whether we were using it effectively. We wanted to see where there was room for improvement, or if there were capabilities offered by PagerDuty that we weren’t leveraging. The next challenge we faced was to decide what exactly needed to change in our on-call process, and how.

Baby Steps with PagerDuty

Conveniently, PagerDuty held their annual conference around this time called PagerDuty Virtual Summit — we decided to attend. At this conference our team learned about their Digital Operations Tier, which presented new capabilities, including intelligent alert grouping, global event rules, and dashboards full of analytics and data about our on-call engineers and which alerts have been firing. Their Digital Operations Tier also included machine learning capabilities which aims to recognize trends in which alerts fire and when. In addition to all of the above, we learned that PagerDuty had new Slack integration capabilities that could be leveraged by our team. With the entire team (and company) working remote, Slack is a huge part of our day-to-day operations — integrating PagerDuty and Slack just made sense. After implementing this new integration, we were able to call upon PagerDuty metrics directly within Slack. The integration was pretty seamless, and the uses were immediately apparent.

Feature Rich Proof of Concept, A Short Story

Thanks to the PagerDuty Virtual Summit, we had many new leads to explore with PagerDuty, so we reached out to them immediately following the conference to explore all the new features that had piqued our interest. PagerDuty worked with us to build a proof of concept and were willing to answer any questions we had as we explored these new features.

First, we enabled intelligent alert grouping on our RDS alerts. The reasoning was, when a DB fails over we would immediately be overwhelmed with a flood of alerts from CloudWatch. By utilizing intelligent alert grouping during a scheduled database maintenance event where 60+ alerts are typically fired, they are now condensed into a single incident for the on-call engineer to acknowledge and resolve. This significantly improved the experience and time spent triaging for the engineer. Instead of the need to acknowledge every single alert amongst a sea of alerts that all come in at once, the on-call engineer saves a a significant amount of time by only needing to respond to one, in most cases. In actual downtime situations, this allows for the on-call engineer to focus on fixing the problem rather than repeatedly acknowledging multiple alerts, all of which were ultimately related to the same incident. Redundancy removed, time saved!

The second feature we tried implementing into our process was using PagerDuty’s new, built-in analytics dashboards. These dashboards showed us insights into our on-call that we never had before! Things like how many alerts were firing, and how they trended up or down from previous weeks. Previously, we did on-call reviews in our team meetings every Friday, but after a long week of alerts it’s easy to let a few things slip the mind and forget to bring them to the team’s attention. Our new dashboards provide greater visibility directly into which alerts were firing the most, and which services they were tied to. These insights, at a macro level, easily highlight where our weaknesses in our own infrastructure & alerting existed, and showed where we could better focus our efforts. These dashboards also showed us that the way we were defining services within PagerDuty was by source (Prometheus, CloudWatch, etc.). Because the dashboard illustrated how many alerts were firing per service per week, and because our services were so broadly defined, we weren’t getting the potential value we felt we could. For example, one of our services was named “Prometheus High Priority”, and encompassed all of our high priority Prometheus alerts. Since our service definition was so broad, it wasn’t pinpointing for us what the alerts firing were actually about, but instead just the monitoring tool they were coming from. With this knowledge, we set out to redefine our services to describe the actual piece of infrastructure they represented, for example Redis or Kafka.

Encouraging Better Service Ownership

We knew that we needed to start encouraging more wholistic ownership of services. As mentioned previously, our DevOps team has grown rapidly over the past year, and so have the products that we support. Between acquisitions and the rapid growth within SailPoint itself, the breadth of technology our DevOps team has to be familiar with is rapidly increasing. Therefore, in order to make our on-call process more scalable to this growing team, we needed to encourage better service ownership. This improvement in ownership would enable alerts to be routed to the relevant people rather than having one person try to be the catch-all for everything — removing expectations to be familiar with all areas of our products. With this change, we are now receiving high-quality insights into our dashboard by having more granular services broken down to better represent the part of infrastructure which they serve.

With all of the changes in the books, we took what we had learned and began the necessary steps to then change and improve our on-call process within DevOps. Our PagerDuty representatives helped even more during this process, with people on hand to answer all of our questions whenever we needed. Additionally, we’ve held weekly success meetings with them as we continue to implement improvements to our on-call process. We’ve since created new teams within PagerDuty to represent each of the new DevOps teams with their own schedules and escalation policies. We’re also working to get Engineering teams at SailPoint onboarded into PagerDuty with their own unique teams and alerts, for fuller and more comprehensive observability coverage.

Lastly, we then implemented the global event rule functionality in PagerDuty that previously we had not used. This functionality, when configured, allows us to route alerts to different teams based on the criteria within the alerts themselves. As such, we updated our alerts to contain unique labels, which we use within the global event rule to sort them to their new appropriate team.

Looking Forward to the Future

This leads us to today — we now have many different teams under the umbrella of DevOps, each with their own on-call rotations, all of which deal with alerts relevant to their team. We still have a “triage” on-call, which supervises incoming alerts and re-routes them to the appropriate teams as necessary. We are still working on improving labelling alerts and routing those alerts to the correct teams, but we have made a lot of headway, getting closer to our goal every day. Since these changes were made, we have seen significant improvements in the on-call experience and it allows for a faster, more agile response. Our team is excited to see further improvements as we continue polishing these new procedures and continue to leverage all the new capabilities of PagerDuty to their fullest extent.

In the future, we look to terraform all of our global event rules and other PagerDuty integrations as well as introduce event runbooks to automatically respond to certain types of incidents. In addition, we are also looking ahead to expand our PagerDuty usage within the company, to expand it past the DevOps team and into the engineering organization. With Terraform allowing us to implement infrastructure that’s easily scalable, and with processing all of our data into dashboards, we are excited for the future and how we can further improve our engineer’s lives while also maintaining a high responsivity to incidents. There are many great things to try yet which directly benefit our engineers, and we are excited about the road ahead. Stay tuned for updates in the future about new innovations we develop in DevOps!

Improving On-call Engineering at SailPoint

Baby Steps with PagerDuty

Feature Rich Proof of Concept, A Short Story

Encouraging Better Service Ownership

Looking Forward to the Future

Written by Jordan Violet