DevOps and Site Reliability Engineering (SRE)
As we all know, the Computer Age and the Internet Age have both profoundly impacted the world of commerce. As customer experience changes, led by internet giants, IT operations change accordingly to support new processes. Not so long ago, new product development could mostly be decoupled from operations. Of course, there were some connections, factories had to retool their machinery if changes were made. Yet the nature of physical products allowed for development operations to drift apart.
With the explosion of cyber property in the last few decades, though, the product mix has changed. Digital products represent a large and growing part of global offerings. An expectation from such a product is to be always-reliable, accessible from anywhere by anyone at any time. Recent offerings from major cloud providers advertise simplicity in supporting this notion. In reality, everything is still technically grounded (servers need to physically be somewhere). To meet market expectations development has to work closely with operations,.
For a simple example, consider a buyer-seller connection service. In the 1970s, perhaps there was a weekly publication of sellers in a relatively small geographic area. Buyers couldn’t directly compete, because the seller could only handle one caller at a time. Today, hundreds of remote buyers can compete directly and instantly, and the seller never has to negotiate with a single one if s/he doesn’t want to. For the retail equivalent (mail order catalogues), in today’s system, there might not be a human involved between the factory and the customer’s house at all.
Similar transformations abound in a plethora of industries. Cyber products are entirely new, and, because reliability and security are paramount, development simply cannot remain decoupled from operations. Moreover, simple yet powerful upgrades from development can be applied with minimal interference and downtime, so why wouldn’t operations departments cooperate with development teams to enhance the customer experience?
What is DevOps? What is SRE?
DevOps — an organizational model that encourages communication, empathy, and ownership throughout the company
SRE — an organizational model to reconcile the opposing incentives of operations and development teams within an organization
These two terms are widely used and broadly applied. Sometimes too broadly. The term Site Reliability Engineering was born at Google, the brain child of Ben Treynor. It, like DevOps, is a blend of operations and development. The most important aspects, similarly to DevOps, are automating operations processes and increasing collaboration. This is especially important in globally-scaled, always-on-demand services, because not all errors and issues can (or even should) be handled by humans. We humans have better things to do.
SRE aims to provide availability, performance, change management, emergency response, and capacity planning. Each of these factors is essential to global-grade services, because the software landscape sees intense competition. A couple days of downtime can mean customers flowing to competitors. This brave new world requires new techniques.
A One-Paragraph Primer on Reliability Terminology
Any operations student would know that there are two parts to reliability. Mean Time to Repair (MTTR) and Mean Time Between Failures (MTBF). The former is how long a system is in error before it is fixed, and the latter is how long the interval is between failures. These two concepts work together, and a balance between them is a golden gold.
Traditional Operations and Development Interaction
The traditional interaction between development teams and operations teams is bipolar. On one end, the development team is tasked with creating new features and attracting customers are much as possible. New features are an attractor, and hence more new features amplifies the attraction factor. Unfortunately, this sometimes leads to development teams to publish updates and features before they are thoroughly tested. It also leads to frustrated operations teams when the service goes down.
Conversely, the operations team is tasked with running the service once it has been approved and established. The ops team doesn’t want more work than is essential, so it encourages longer and more rigorous testing periods before release. This leads to long lead times and frustrated development teams who just want to push out the newest, coolest features.
Is there no middle ground?
The conflict between dev and ops can be palpable. Sometimes responsibility for code is even hidden from operations to limit fallout onto one person, which is known as information hiding. This is not an efficient or well-oiled system. How can we reconcile the seemingly opposite goals of development and of operations? In SRE, the term is “error budget”.
According to the creators of SRE, a 100% reliability rate is unlikely, and maybe not even desirable. 99.9% reliability is indistinguishable from 100% for the userbase. Maybe 99% is your target. It depends on the users and what level of reliability they are willing to accept. This level is defined multilaterally (see “Moral Authority”).
Whatever your target, the difference between your target and 100% is the “error budget”. The development team may produce code that has an error rate up to the budget. That means they can do less testing or roll out less stable features, as long as they don’t surpass the budgeted downtime. Once the downtime allowance is surpassed, all future launches must be blocked, including major ones, until the budget is “earned back” with performance that is better than the target reliability rate.
This small but brilliant change has interesting consequences. The dev team attempts to code for low native error rates, because they want to use their budget on more interesting and fun features, not the foundational code. Furthermore, the dev team starts to self-police, because they want to conserve the budget for worthwhile launches, not consume it on errors in basic features. Finally, there is no blame or info hiding, because everyone agreed to the budget in the first place. This leads to empathy and communication between teams, replacing the sometimes hostile environment of the traditional dev-ops relationship.
In an organization, especially in the tech world, it is imperative that employees believe in their leadership. A rogue team is disastrous, and sabotage is a real threat. Whence stems the moral authority for SRE? This lies in the budgeting process. Development, operations, product managers, and organizational management agree to Service Level Agreements (SLAs), which state the minimum uptime (which necessarily stipulates the maximum downtime) that is acceptable to customers.
This is the foundation for the budget. If customers are willing to accept 99.5% uptime, then the budget is 0.5%. And since the development team has agreed to this level, they have no authority to challenge SRE blocking their launches if the budget is spent. Everyone agrees beforehand, so there is no political jockeying once the system is live.
Monitoring, Reliability, and Service Degradation
A public-facing system will inevitably be down sometimes. Even if the MTTR is extremely short and unnoticeable by customers, the system has still failed. This is the reason monitoring (and preferably automated monitoring) is essential.
According to Treynor, there are three parts to monitoring. First is logging, which is mundane and mainly for diagnostic and research purposes later. This isn’t meant to be read continuously, only used as a tool for later review, if necessary. Then there are tickets, for which humans must take action, but maybe not immediately. Then there are alerts, such as when the service is offline for most customers — these require immediate human response, likely in the form of an emergency or crisis response team.
Most error handling should be automated, and this is an area where machines fix themselves. The more machines fix themselves, the better. This quick, automatic repair is related to reliability via Mean Time to Repair (MTTR). If service problems occur but the MTTR is a few milliseconds (because computers are fixing themselves), then the users will never notice. That means dev has more available budget, a good incentive to develop automated error-handling systems.
Now, what to do when the MTTR is longer than a few milliseconds. Many errors will be on back-end systems, and with replication, there may be no discernible issue for the front-end site or service. If, however, issues apparent to the consumer are inevitable, it is best to engineer for “graceful degradation”. This just means you don’t want your service blacking out completely, but maybe slowing down or lowering service quality. A complete blackout with a completely unreachable or unresponsive service will cause customer backlash. Degraded service will cause annoyance, but probably not drive them away. This can be accomplished via Microservice Architecture, as one service going down does not take down the entire service.
From the customer viewpoint, lots of short MTTR errors is probably better than long but infrequent errors, because short MTTR errors are often eliminated before customers even notice. On the other and, if a firm doesn’t implement a system for these errors, the exact opposite is desirable: one long outage means one long fix, not an endless stream. Hence, to reconcile this conflict, it is strongly suggested to create a system to handle issues. And when the company scales, it is all but imperative to automate, because problems will inevitably outstrip operation headcount.
Why is SRE important?
All organizations want to provide excellent service to users. All organizations have organizational structure, and sometimes that structure includes competing teams and incentives. SRE attempts to eliminate one major issue, especially in modern organizational structures. Chaos behind the scenes will eventually lead to chaos on the front-end, where customers can indirectly observe the Pyrrhic war between development and operations end in a spectacular implosion of the service (and the customer base).
How is SRE related to DevOps?
The first and most obvious way it is related is in using software techniques in operations. But that is trivial, especially in tech companies, because modern operations departments all rely on software to some degree. Both also foster inter-team communication.
However, DevOps encourages communication between teams across the organization, while SRE encourages communication between the development and operations teams. DevOps is concerned with broad empathy and ownership (even involving sales and marketing), while SRE tends to focus on only development and operations. Furthermore, in DevOps, the development team will feel responsible for the life of the product, while in SRE, dev might self-police, but the ultimate operations responsibility still lies with operations.
There are yet more similarities, though, such as the tendency to automate as much of the operations process as possible, including continuous delivery procedures: dev teams under an SRE model might roll out small updates to stay under the error budget, while dev teams under a DevOps model tend to make small updates for easier monitoring and bug identification. Both encourage scalability, such that products not only have solid foundations and native code, but that base product can expand with the business.
As with anything in organizational management, these terms are not mutually exclusive, and they do not have to be separated. Furthermore, each company has its own unique culture and needs, so applying aspects of DevOps and aspects of SRE simultaneously is not taboo. In fact, it is viewed positively. Innovative companies always look for the best aspect of something, extract that best aspect, and adapt and apply that aspect to their own needs. Don’t be afraid to be unique, and certainly don’t be afraid to stand on the shoulders of giants.