DevOps and Site Reliability Engineering — Better Together

Andrew Turner
back to the napkin
5 min readMar 8, 2018

--

CIOs won’t get promoted if everything works. But they’ll get fired if anything doesn’t.

End-user expectations for application performance are at an all-time high. Security threats grow more sophisticated on a daily basis. And as a result, CIOs are under more pressure than ever to maintain reliability and stability of business systems.

This pressure has sparked greater adoption for site reliability engineering (SRE) as a formal practice in the developer community.

In many conversations, SRE is discussed as an evolution of DevOps. While they are unique practices, engineering teams shouldn’t be choosing between the two. Instead, the two should work together toward successful production deployments.

THE LINE BETWEEN DEVOPS AND SITE RELIABILITY ENGINEERING

There was a time when reliability concerns could halt development before a project even started. However, it’s no longer acceptable to forgo innovation to maintain reliability — you need both.

Because SRE and DevOps are so complimentary, there’s often confusion between the two. They’re similar practices with different focuses.

  • DevOps: The focus is on streamlining the process between code written to running in production. It’s everything from automating the CI/CD pipeline and server configuration to the process, orchestration, and tooling that gets developers up to speed and able to push code on day one. The DevOps culture shift within engineering is a response to demands for agility, moving code through the pipeline as efficiently and effectively as possible.
  • Site Reliability Engineering: The focus is on effectively building, running, and growing systems in production. This includes ensuring the stability and resilience of a production system in addition to continually improving performance while building features. With SRE, you balance the need for site reliability with the need to ship new features. And it’s not just building with reliability in mind. It’s questioning how you can make a stable system run even better.

You don’t need to choose which side of the line your team falls on. Instead, focus on blurring the line between the two practices.

AREAS WHERE DEVOPS AND SITE RELIABILITY ENGINEERING OVERLAP

Adopting an SRE practice is all about having the right monitoring, tooling, and processes in place so that you confidence when deploying a release that it will add value and also meet availability requirements.

But when you look at the actual deployment, you can see where DevOps and SRE start to overlap. You don’t want to deploy a large-scale system as if you’re just pushing a red launch button and sending the whole thing out into production at once. You want a rolling deployment to ensure the release is applied cleanly.

The orchestration planning within DevOps practices support rolling deployments, which in turn leads to a more reliable production system. The two practices feed into one another.

You see this overlap in quality agreements, as well. A key aspect of SRE is establishing SLA budgets that you use to measure the success or failure of a release. Often times these budgets are centered around the uptime percentage your system guarantees — for example a four nines system guarantees 99.99% uptime or at most 4 minutes of downtime per month. Site reliability engineers work with engineers to build SLA contracts and ensure everything from tooling to monitoring, capacity planning, and load testing are in place to meet those contracts in production.

A parallel can be drawn between these SRE metrics and metrics used in DevOps processes. For example, it’s important to set a code coverage threshold in the DevOps CI/CD pipeline. Engineers and developers have bought into quality metrics like code coverage, creating a unifying point for SRE and DevOps. Engineering teams holding themselves accountable to meet certain SRE metrics is at its core the same thing.

There’s enough overlap between SRE and DevOps that you can seamlessly connect the two and build reliable production systems without completely overhauling your engineering team.

NEW CONSIDERATIONS WHEN PARTNERING FOR A PROJECT

In plenty of cases, development teams spend so much time and effort building a valuable new system, get to the release point, and hit a roadblock. In the risk/vulnerability assessment stage of DevOps, the team notices something that prompts extra effort to fix and delays the release.

The goal of SRE isn’t to block features from shipping. Rather, SRE aims to establish the right monitoring, tooling, and processes with development teams to build confidence that releases won’t negatively impact systems. You put reliability workflows into the engineering process from the beginning so you’re always building with security, resilience, and scalability in mind.

But this leads us to a unique challenge when you’re evaluating potential partners for your new project. You have to make sure the partner is focused on these points from a production perspective.

In many cases, you could partner with a services company that’s working on many new builds for a variety of clients. Their DevOps processes are tried and true. However, they’re solely focused on the actual build process. The system goes into production and may very well be unstable from the start. And then your internal team is left to deal with those problems alone.

The key is to find a partner that is thinking about how a system will run in production before it is built. A partner that properly blends SRE and DevOps to get systems into production efficiently and effectively will eliminate headaches for your internal team down the road.

Even though SRE has only recently been pushed into the spotlight, Dialexa has been blending the availability and reliability of production systems with DevOps for years. We do this by integrating the quality team in the end-to-end DevOps workflow. From the very beginning of your project, we have quality contracts in place to agree on metrics like code coverage and downtime thresholds.

on’t get stuck in a conversation between DevOps and SRE. The ultimate goal is to build a valuable system quickly and effectively without sacrificing production reliability, security, and scalability. And Dialexa would love to help you get there.

Get in touch with us today to learn more about how we can blend SRE and DevOps for your next project.

Originally published at https://by.dialexa.com/devops-and-site-reliability-engineering-better-together.

At Dialexa we start by asking “Do you know what your business will look like tomorrow?” Whether you have a plan, a problem or no idea, connect with us to explore the right answers for you.

--

--