More SRE Lessons for SOC: Simplicity Helps Security
As we discussed in our blogs, “Achieving Autonomic Security Operations: Reducing toil”, “Achieving Autonomic Security Operations: Automation as a Force Multiplier,” “Achieving Autonomic Security Operations: Why metrics matter (but not how you think)”, and the latest “More SRE Lessons for SOC: Release Engineering Ideas” your Security Operations Center (SOC) can learn a lot from what IT ops discovered during the Site Reliability Engineering (SRE) and DevOps revolution.
Let’s dive into another fascinating area of SRE wisdom that is deceptively simple — the principle of … simplicity (SRE book, Chapter 9 “Simplicity”). Say what? This sounds abstract and philosophical, how can it help my SOC today? Well, let’s find out!
The first point they make is a reminder of what makes it all exciting: “Software systems are inherently dynamic and unstable.” But SREs have it easy, here in our beloved realm of cyber, we don’t just have “systems that are inherently dynamic and unstable” but also attackers that affect them in a set of dynamic, beautiful (sometimes ugly) and unpredictable ways. 10X fun assured! So, yes, this is a big part of why security is fun, but also tricky. This means we need simplicity even more.
But what is simplicity? Phil’s 8 megatrends blog reminds us about this by calling one of his cloud megatrends “Simplicity: Cloud as an abstraction machine.” Specifically:
“A common concern about moving to the cloud is that it’s too complex. Admittedly, starting from scratch and learning all the features the cloud offers may seem daunting. Yet even today’s feature-rich cloud offerings are much simpler than prior on-prem environments — which are far less robust. […]
Cloud is only going to get simpler because the market rewards the cloud providers for abstraction and autonomic operations. In turn, this permits more scale and more use, creating a relentless hunt for abstraction. […]
The increased simplicity and abstraction permit more explicit assertion of security policy in more precise and expressive ways applied in the right context. Simply put, simplicity removes more potential surprise — and security issues are often rooted in surprise.”
Thus, removing surprises and reducing unique/broken/snowflake systems and silo’d processes will make security (and SOC in particular) easier. SREs have already figured much of this out.
To dive into the details, they say that their “job is to keep agility and stability in balance in the system.” For us in security, another (tricky) dimension gets added: the threat. Occasionally, compliance gets blended in too, and that may push the organization to old, ultimately more fragile ways of doing things (as a side note, compliance that reduces security by pushing for outdated approaches is a real thing). Fragile systems (and processes) + threats + regulations = a lot of complexity. And a complex system is always hard to monitor for threats, which are then also hard to investigate, making SOC life a pain.
Also, SREs say that “reliable processes tend to actually increase developer agility.” We wish we can always say that secure processes do the same. And, how about that, many actually do! Here at Google we have many examples of something that is secure and good for developers and good for business. Think well-implemented zero trust, that helps users, simplifies IT and reduces risk. It also makes the job of a SOC easier. Anyhow, we digress a bit…
Let’s look at more manifestations of the SRE principle of simplicity. Now, this is really juicy: “Essential complexity is the complexity inherent in a given situation that cannot be removed from a problem definition, whereas accidental complexity is more fluid and can be resolved with engineering effort.” This line alone is magical for the SOC!
I’ve always said, for example, that SIEM is complex, largely because its mission is complex. Now, if your SIEM vendor makes SIEM as complex as the mission, but not more complex, you may have a winner. Ideally, it should make it simpler, but frankly it won’t make it simple. Because it is simply not! However, excessive and removable complexity is a dire enemy of security.
Put another way, detection is hard, but some tools make it harder. Don’t use those, use the ones that don’t add complexity. “Push back when accidental complexity is introduced” as SREs say. We definitely need to fight this battle in the SOC.
Further, if SREs say that “every new line of code written is a liability”, then in the SOC every detection rule is. You deploy this detection, and then you create work, toil in some cases, for you or your colleagues who have to respond to the resulting alerts. Think about it! The way to make this work, rather than fail, is to have a solid lifecycle for all detection content, in my view. Then every line you add delivers more value than liability, even if liability is never 0.
“The ability to make changes to parts of the system in isolation is essential to creating a supportable system.” So? To me, a bad “integrated” platform is worse than two good tools that can hook into each other via APIs. This is why I think best of breed ultimately won in security and suites and broad over-promising platforms lost (although this point is frankly very contentious, so let me quickly shuffle away from this particular argument…)
“Simplicity is an important goal for SREs, as it strongly correlates with reliability: simple software breaks less often and is easier and faster to fix when it does break. Simple systems are easier to understand, easier to maintain, and easier to test.” And of course simple systems and processes are easier to secure and monitor for threats. Now, some readers may say “but wait, I am a bank with 300 years of history, every process I have is complex, not simple.” Sure, but you still get to “push back when accidental complexity is introduced.” If your IT is inherently complex, the fight for reducing “excessive and removable” complexity is needed more, not less.
Naturally, simpler systems in your SOC help even more. Do you really need this rule with 5 correlated states or a playbook with 30 decision boxes? And if you have to make a 30 box alert triage process flowchart, then don’t make a 70 box flowchart?
To summarize, they say “software simplicity is a prerequisite to reliability.” We can add: also for security and threat “detectability” and “investigability” (can we just say observability?).
So, your mission, should you choose to accept it, is to push unneeded complexity out of your SOC. Where does complexity hide in your SOC? In detection content? Playbooks? Escalation processes? Workflows that involve other teams? Metrics and associated data collection? This is where you go and look at reducing the complexity.
Finally, “For SREs, simplicity is an end-to-end goal: it should extend beyond the code itself to the system architecture and the tools and processes used to manage the software lifecycle.“ I wish we had this for security and SOC in particular.
P.S. So I reread this post a few times (well, OK, more than a few times) and it still looks more conceptual than practical. So, perhaps one practical tip: when you encounter or create a SOC process, or a piece of technology in or around your SOC, think “does this add complexity?” and “is this complexity truly necessary?” If YES and NO, then think how to do things differently. If YES and YES, then think if the second question answer really is a YES…
Related blog posts:
- “Achieving Autonomic Security Operations: Why metrics matter (but not how you think)”
- “Achieving Autonomic Security Operations: Automation as a Force Multiplier”
- “Achieving Autonomic Security Operations: Reducing toil”
- “Taking an autonomic approach to security operations” video
- “New Paper: “Future Of The SOC: Process Consistency and Creativity: a Delicate Balance” (Paper 3 of 4)”
- “New Paper: “Autonomic Security Operations — 10X Transformation of the Security Operations Center””
- “EP75 How We Scale Detection and Response at Google: Automation, Metrics, Toil” podcast episode