Recently we attended Kubecon and AWS Re:invent and noticed the biggest theme was the rise of the Site Reliability Engineer (SRE). According to Seth Vargo and Liz Fong-Jones of Google, a SRE “embodies the philosophies of DevOps with a greater focus on measuring and achieving reliability through engineering and operations work.” We’ve written about SREs previously, but there’s been a proliferation of tools over the past few months so we decided to categorize +40 solutions based on nine different aspects of a SRE’s incident life cycle.
We’ve broken down the suite of offerings into two main groups “preventative” and “post-incident.” Preventative solutions help proactively reduce downtime and provide visibility into systems to detect issues. Post-incident solutions help teams work together to remediate incidents as quickly as possible as well as capture best practices in post-mortems to improve processes.
Within the preventative bucket that are three main categories: 1) a service catalog, 2) SLO/error budgeting, and 3) resiliency testing (chaos engineering). A service catalog identifies all the services, their version, language, packages, dependencies, deployment, owner, status, and relationships to each other. Service catalogs help teams understand their production environment.
Service Level Objective (SLO) are “a target value or range of values for a service level” while “an error budget is 1 minus the SLO of the service.” SREs use error budgets to “balance service reliability with the pace of innovation.” Error budgets can be spent on deploying new software to production that may have issues or can be the result of poor performance or downtime from existing software. Tracking these KPIs helps teams make decisions around shipping software and policies. We combine service catalogs and SLO/error budgeting into one category as these functions are tightly coupled and many vendors offer both.
Finally, resiliency testing is the process of injecting failure into systems to proactively find vulnerabilities and address them before they result in production downtime. Netflix pioneered resiliency testing at the beginning of the decade, and it has become a best practice for SRE teams. Today there are both open source (Chaos Monkey) and commercial offerings (Gremlin).
Solutions that fit into the post-incident category include alerting, chat, video conferencing, issue tracking, status pages, incident response platforms, and documentation (post-mortems). Most of these categories are well-defined, but it is important to highlight incident response platforms, SaaS solutions that help automate remediation activities. These solutions assign incident response roles, identify impact/implicated system, execute incident runbooks, coordinate between teams, collect real-time incident history, and communicate status. Once a remediation is complete, incident response platforms automatically bring forensic data and proper context together for post-mortems that are used to identify system and process improvements plus follow-up actions. Many businesses have built incident response platforms internally including Stripe’s Big Red Button, Pinterest, Twilio, among others.
Below are +40 solutions across nine SRE categories. There are some solutions that provide functionality across multiple categories. We believe many of the point solutions will converge over time, and there will be a best of breed SRE solution that provides both preventative and post-incident help.