Site Reliability Engineering in Enterprise-Scale Organizations

Published in

Capital One Tech

8 min readApr 5, 2017

As companies become increasingly focused on their digital operations, one important concept being delineated is around role definitions — “Who is doing what?” Once sales and servicing channels transition to digital, everyone becomes part of the technology team. What implications does this have for your original and expanded team? Concepts and terms like “DevOps” and “NoOps” are used to define strategy and highlight organizational boundaries for how software products are shipped and run. Unfortunately, the terminology is so catchy that it’s applied to roles like “DevOps Engineer”, which creates confusion as it is vague and open to interpretation.

I prefer the term Site Reliability Engineer role as it describes the daily activity performed by staff in modern Enterprises. I’ve seen some great success among teams using these concepts and setting expectations for associates to develop in their career. It also solidifies the notion that Operations isn’t going away, rather is being redefined in a digital world.

Here’s my insight into the unique nature on the role of a Site Reliability Engineer within Enterprises given the size and diversity of the application portfolio.

Three Types of Applications in the Enterprise

Let’s start by understanding the domain and setting context, starting by defining the scope of software products.

Within a large enterprise, I believe there are three different categories that software applications can be categorized in which drive the staffing needs of a technology org.

1 — Custom Ongoing Development

This is the traditional software engineering model where there are two groups that have historically split responsibility of an application — a development and an operations team. This model is where the DevOps concept comes in, and where some strategists focus on.

In this category, minimizing handoffs is key to any efficient process, and comprises a core organizational design principal. Having a feedback loop established whereby production quality issues get connected back to the software authors is another area of focus, and defines the allocation of tasks between the two groups.

The majority of the work goes back to the developer, thus enabling much of the overall effort be controlled by the product team. Here’s a potentially more accurate visual based on staffing levels, or workload.

Here are a few key observations from pushing towards this model.

1 — The majority of staffing is in the software engineering teams.

2 — Getting to the point where 100% of the work is with the software engineers requires really broad skill-sets within these teams.

3 — There are limits to how many “operations” tasks software engineers want to do.

When establishing the staffing strategy for the organization, there are two other application patterns that must be accounted for when building a technology organization as not all software applications fit the model above.

2 — One-Time Application Development

This is a variation on the first category, but with a key distinction — once the application goes live, the developers leave. Sometimes they leave the team/project and join different teams/projects, leveraging their expertise and talent elsewhere in the company. Sometimes the staff used to build the application was temporary, and once the product is delivered there’s no reason to retain them.

An average software application has a life expectancy of around seven years and its not uncommon to find applications that have a life-expectancy of decades (see: mainframes). Over time, the volume of these applications accumulates. For a large enterprise, this volume can be considerable. It’s also not necessarily a bad thing for the company, rather an decision driven by sound financial reasons.

From a practical standpoint, it’s nearly impossible to completely “freeze” an in-house software application. There remains plenty of other work around security patching, upgrades to the underlying hardware, ongoing system monitoring and metrics collection, disaster recovery exercises, etc. This means there’s a need for an ongoing technical staff to execute these tasks on a day-to-day basis; it just might not be the staff that will enable new feature development.

3 — Package Development

This is similar to the prior model. There are software developers that wrote the product, but they never were part of your company, rather there was a pre-made solution available that was purchased and installed. There might be some in-house resources that see themselves as the software engineers for these applications, but they’re modifying configuration tables and workflows, and are not experts at the underlying software that is executing. This means they have limited ability to support and troubleshoot.

Now, there likely is a contract in-place that dictates the terms of support provided by a vendor after the sale; but few of them are worded such that any and all work will be handled by the vendor. There are also practical limitations to be considered. For example, will the vendor have the ability to remote into the system for support, or will they rely on local technical staff to be “hands on keyboard”?

This drives the need for some in-house staff that can handle front-line troubleshooting and support after the fact. This may include applying minor patches to the software, or changing interfaces with other systems like single-sign-on or system monitoring and alerting.

Managing a Portfolio of Applications

Staffing within an Enterprise becomes an exercise in understanding how many software products are allocated across these three groups. What may be unique for Enterprises vs. Startups is the amount of technology a company that is 20–30 years old or more has built up, creating a surprising amount of work in the second and third categories.

When looking at technology for the past few decades, there are different distinct “eras” where mainframes were popular, packaged software, etc. These decisions and trends leave behind a support trail that needs to be accounted for in staffing.

Based on my experience, its not uncommon for a Fortune 500 company to have hundreds (or maybe thousands) of software applications based on the span of their existence.

While new investment dollars may be targeted to applications that fall under the first category, the technology budget and staffing plan must still account for all three.

The Value of Specialization

Now, let’s dive deeper into the application scenario (category 1) to determine the division of labor. These applications could be the core processing and servicing systems that companies process millions of invoices, claims, statements, and payments through. There is permanent staff associated with them, and team sizes may make up sizable percentages of the technology staff.

The division of work should not be driven by trying to understand what appears to be “operations” type work. Rather it should be driven by finding economies of scale in specialization that drive both cost and quality benefits for the company.

For example, large systems need backups, and replications to DR locations. Ongoing monitoring of these functions is critical, and shouldn’t be left to guesswork. Assigning this work to each software engineer on the team is inefficient, and potentially dangerous to the environment. Why? The reason is that they don’t do it every day, and there’s tremendous value in following a common approach that can improve adherence. By having a specialist do work like this, it creates ownership and incentive for someone to automate the task for the entire group or department.

Time Study — How to Find Specialist Tasks

To help establish the boundaries between roles, a methodical approach involves executing a time study on the stories required to execute a project through go-live. An example of what this might yield is an artifact like this.

It shouldn’t be that all specialist tasks go to a SRE organization by default, as some of the non-routine tasks reinforce the core engineering of the product. The value is in understanding all of the miscellaneous tasks that are required to deliver working code to production, and understanding ownership of those tasks. Is the software engineering team going to be able to own all of these tasks over time, far beyond the initial release?

If not, there’s a staffing gap that needs to be planned for.

Setting Expectations — the Move From Ops to SRE

Now after seeing the different models, hopefully it’s clear that there are needs for staffing technical talent within the Enterprise that aren’t focused on software product development. A good question to ask is how this differs from traditional technical operations roles that we have had for decades within Enterprises.

One word that sums it up — Evolution. As companies become more digital, the skill-set demand is increasing. Simple requests like password resets are now handled by self-service portals and workflows. Most users are comfortable with search engines, and can use them to solve routine problems that already have a known solution.

Ideally, any routine work gets automated and that requires many of the same engineering skills that a software developer has, but with more of a systems engineering focus.

Establishing Site Reliability Engineering teams that embody these new capabilities for Enterprises is the future of staffing in a digitally focused organization. Is your Enterprise ready?

DISCLOSURE STATEMENT: These opinions are those of the author. Unless noted otherwise in this post, Capital One is not affiliated with, nor is it endorsed by, any of the companies mentioned. All trademarks and other intellectual property used or displayed are the ownership of their respective owners. This article is © 2017 Capital One.

For more on APIs, open source, community events, and developer culture at Capital One, visit DevExchange, our one-stop developer portal.developer.capitalone.com/