Site Reliability Engineering (SRE): More Code, Less Toil

In 2003 Google created the Site Reliability Engineer (SRE) position. SRE teams have spread throughout the industry, and there are now ~25K SREs according to LinkedIn. With the rise of the SRE role, we believe there are opportunities to build large businesses that support the SRE workflow.

Developers have engaged operations teams for decades. In the 1990s and early 2000s most operations teams were called sysadmins and worked with physical infrastructure. Traditionally, developers threw code over the wall to operations who were responsible for configuration, provisioning, and management of the systems. Virtualization and cloud computing made infrastructure programmable and remote. Over time as software abstracted additional operational capabilities and intelligence, operations began writing more code. The Devops movement mirrors the infrastructure as code trend.

Unlike Devops that originates from operations, SRE teams stem from programming. According to Benjamin Sloss, VP at Google Engineering and founder of Google SRE, SRE is “what happens when you ask a software engineer to design an operations team.” Sloss states that the SRE team is “responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s).” SREs focus on writing software to automate processes and eliminate toil.

Similar to Devops, SRE breaks down the developer and operations silos so organizations can deliver applications and services at higher velocity. SRE can be viewed as a Devops specification.

Importantly, SRE teams are responsible for maintaining and establishing service level indicators (SLIs), objectives (SLOs), and agreements (SLAs). A SLI is a quantitative measure of some aspect of the level of service that is provided like request latency. A SLO is a target value or range of values for a service level that is measured by a SLI. For example, 99.999% service availability. SLAs are an explicit or implicit contract with users that includes consequences of missing SLOs. SRE teams must balance the costs of maintaining certain SLOs with business objectives like innovation and rapid deployment.

Solution categories that help facilitate SRE’s work include monitoring, alerting/incident management, ticketing, logging, troubleshooting, configuration management, and reliability testing. As noted in the exhibit below, we’ve seen the emergence of large businesses in categories that help SREs be more effective at their work.

Solutions that help decrease the Mean Time To Resolution (MTTR) and improve service uptime present the highest value to SREs. After all, who wants to be on call dealing with fires at 2AM?! Prevention of sleepless nights and all-nighters alone suggest SREs’ appetite to pay for solutions that make their lives easier. We continue to believe there are exciting opportunities in this space and history suggests these businesses can be huge.