Improving SLA and RTO in AWS using PagerDuty
When discussing disaster recovery, key terms such as Recovery Time Objective (RTO) and Recovery Point Objective (RPO) are often emphasized. However, an equally crucial metric is the Service Level Agreement (SLA). Incorporated in the SLA is often the amount of time between when an issue is reported/acknowledged, and the start of remediation. Faster awareness leads to quicker actions and recovery.
A well-known TrackIt customer recently held an interactive media and music festival with a large amount of streaming video content. As with any live event, there often isn’t the luxury of time when it comes to addressing issues. Recognizing the importance of rapid resolution in these situations, event administrators operate with a heightened level of alertness. Even with regular monitoring there can be reponse delays. However, the one area that can make a difference, especially with very short SLAs, is the Alerting Process.
AWS’s native alerting solution Amazon Cloudwatch is a good option in these scenarios. However, it might not always be the best tool for the job. It can create alarms based on system metrics or custom logs but offers limited options when it comes to handling post-alarm actions.
As an alternative, TrackIt turned to PagerDuty to implement the alerting process for the event. Pagerduty, as the name suggests, is a paging system with a specific use case, but it performs exceptionally well. The goal with this customer was to ensure attention and prompt response to issues within 15 minutes, a challenging task. Below is an explanation of how this system was designed.
Step 1 — Issue Ingestion
Issue ingestion into PagerDuty can be handled in many different ways: email, direct from forms, direct from integrations (AWS, Google, etc.), and Slack, just to name a few. In this case, the chosen method was email, a system familiar to everyone.
Step 2 — Issue Acknowledgement
Once an issue has been received by PagerDuty, the next task is to alert the first responders (Tier 1). This is where PagerDuty excels, as alerts or “pages” can be sent in multiple ways:
- SMS
- Phone call
- Push notification (requires the PagerDuty mobile app)
- Chat systems (Slack, Teams, etc.)
Any combination of these methods can be utilized.
Step 3 — Issue Escalation
PagerDuty can handle two types of escalation: escalation over time and escalation to a different group of support or engineering. In addition to these escalation types, simple forms of logic can be factored in.
For this event, two groups of engineers from different time zones were utilized. This approach provided better coverage by ensuring engineers were on call during their respective working hours.
The goal is to guarantee a response within 15 minutes or less. With the current solution, if someone is eating lunch, walking their dog, or taking a shower, an alert could be missed. Therefore, escalation over time was also included.
With this added redundancy, the appropriate group is paged first. If there is no response within 5 minutes, the group is paged again. After another 5 minutes, if there is still no response, both the initial group and the secondary group of engineers are paged. If there is still no response after another 5 minutes, both groups and management are paged, informing them that the 15-minute SLA has been breached.
Step 4 — Issue Resolution
The issue from the customer has now been ingested and the appropriate parties have been paged. The next step is to close the loop and reach back to the customer with a resolution. Using the PagerDuty interface, a response was sent directly to the customer to notify them of the issue’s completion.
Closing Thoughts
The very short time frame of 15 minutes for this system was the major challenge of this project. The overall PagerDuty alerting system was relatively simple. PagerDuty can handle more complex systems if required. As previously mentioned, PagerDuty excels at its specific function. For ensuring people are paged for issues or alerts, it serves as a comprehensive solution. Additionally, incorporating a ticketing system such as Zendesk or Zoho Desk can further enhance the robustness of the system.
About TrackIt
TrackIt is an international AWS cloud consulting, systems integration, and software development firm headquartered in Marina del Rey, CA.
We have built our reputation on helping media companies architect and implement cost-effective, reliable, and scalable Media & Entertainment workflows in the cloud. These include streaming and on-demand video solutions, media asset management, and archiving, incorporating the latest AI technology to build bespoke media solutions tailored to customer requirements.
Cloud-native software development is at the foundation of what we do. We specialize in Application Modernization, Containerization, Infrastructure as Code and event-driven serverless architectures by leveraging the latest AWS services. Along with our Managed Services offerings which provide 24/7 cloud infrastructure maintenance and support, we are able to provide complete solutions for the media industry.
About Chris Koh
Chris is a DevOps Engineer at TrackIt. He has been in the technology industry for a decade, with the last 7 years working directly with AWS. He started his venture into DevOps as his desire to automate as many workflows as possible grew. Everything from reporting, user management, system management, all the way through CI/CD, if it can be automated it should be automated.
Chris holds 5 of the currently available 12 AWS certifications (Cloud Practitioner, Solutions Architect: Associate, SysOps Administrator: Associate, Security: Specialty and DevOps: Professional). He has his sights set on getting all 12 certifications to get the little known golden jacket. Outside of AWS certifications he is also looking to attain his CCNA and CCNP.