Is Site Reliability Engineering the next step of the mainframe modernization journey?

Guilherme Cartier
Modern Mainframe
Published in
4 min readFeb 11, 2021
This overall view of the Shuttle (White) Flight Control Room (WFCR) in Johnson Space Center’s Mission Control Center.
Johnson Space Center’s Mission Control Center

I’m sure that by now you, my dear reader, had the time to realize that our planet is crowded with a stubborn little species called Homo Sapiens, a.k.a “the humans”; These so-called “humans” are very good at — among many other things — creating complex problems. The consequence of this is that modern life is brimming with complicated puzzles that need solving. Some of these problems are very exciting to solve because they drive innovation, others not so much. Regardless, any software system dedicated to handling these problems is inherently dynamic and unstable.

Social media, financial institutions, eCommerce, industry, health care, and even the government; No matter where we look, we are surrounded by complex data-intensive systems, and in most cases, our society is only able to function when all the gears of these elaborate systems operate correctly. Sometimes, when one of these systems fails, we might be unable to tweet for a few hours. But other times, a critical failure in one of these systems might represent considerably large consequences.

So how do we avoid the failure of such systems?

As previously established, a software system can only be perfectly stable if it exists in a vacuum. So one option would be to merely stop changes to the codebase and freeze the user base of existing systems to avoid new bugs and the necessity to scale; On the other hand, we can only move forward through innovation, and innovation implies risk. So the question now shifts to how do we balance the scale between innovation and stability?

This paradox between the necessity to innovate while maintaining the already-existing systems running is the fuel that drives a never-ending war between engineering teams and operations teams. While one side needs to release new products and features fast, the other is struggling to keep all the production systems running reliably at the risk of every change introduce a bug that might cause a service interruption.

Embracing development best-practices by themselves is not enough. Keeping these systems running is a non-trivial task that requires professionals from many different disciplines. Any approach to solving this challenge must come with new management techniques, methodologies, and tools.

Site Reliability Engineering

Amid this apparent chaos, a new job role emerged, a function focused on the reliability and maintainability of production systems. Site Reliability Engineering — or SRE for short — is a job role that was originally conceived at Google, and since then, it has been embraced by large tech companies like Netflix, Facebook, Amazon, and many others.

According to Ben Traynor, the founder of Google’s Site Reliability team, an SRE is “what happens when a software engineer is tasked with what used to be called operations”;

Although the SRE role was originally sustained by the simple premise that “infrastructure management is a software problem” — hence requiring engineering work — it has since become much more: a set of principles and practices, which the ultimate goal is to run better production services. Many of these practices focus on the importance of automation and observability of large scale applications.

With these practices in mind, an SRE has to be able to engineer creative solutions to problems, strike the right balance between reliability and feature velocity and target appropriate levels of service quality. It’s a role that came to heal the divide between the engineering teams and operation teams.

But isn’t this what DevOps is all about?

DevOps and SRE share many common principles; While we might see DevOps as a broader philosophy, the SRE would be a concrete implementation of that philosophy with some idiosyncratic extensions. If we think about it more abstractly, using an object-oriented programming reference we could say that the “class SRE” implements the “DevOps interface”;

Mainframe and SRE

It’s clear that the SRE role makes sense for big tech companies that manage large complicated distributed systems, but does it make sense for the mainframe reality? I’d argue that not only it makes sense, but it’s an irrevocable part of the platform’s modernization journey.

Many of the world’s critical systems are still running on mainframes, and that’s not a coincidence; the platform is extremely reliable. If we think about it, mainframes are the ultimate data-intensive system. Every single component — from hardware to software — was carefully crafted to be able to deal with huge amounts of data securely and reliably at the utmost performance levels.

But no one — besides hobbyists — would want to develop and deploy a complex web application that depends on an outdated development ecosystem, or even worse, having their application architecture limited to a monolithic approach due to the platform’s limitations.

Modern development philosophy dictates that software architecture shouldn’t be dependent on a specific platform or framework. Instead, it should be flexible enough to allow independently deployable services and enable teams to choose the best technologies to solve business problems.

That’s part of the reason why mainframes are still out there. It’s a state-of-the-art technology that has been evolving since the late ’60s to adapt to the modern world, and that’s why today, it has become a no-brainer experience to deploy highly scalable applications that leverage the best that the platform can offer.

For this reason, a professional dedicated to the reliability of a system like this sounds very much like the next logical step. SRE brings a mindset that represents a significant break from the industry’s traditional practices for managing large and complicated systems. Its adoption for the mainframe reality will most definitely not happen overnight, and when it does, it will be somewhat different from the implementations that we see out there. But one thing is for sure; the core principles will remain the same. The practices and tenets that guide how to run high-availability production systems are independent of the platform.

--

--

Guilherme Cartier
Modern Mainframe

Developer of things, explorer of abstractions, sci-fi enthusiast, passionate advocate.