Stealing More SRE Ideas for Your SOC

Anton Chuvakin
Anton on Security
Published in
7 min readDec 21, 2021

As we discussed in “Achieving Autonomic Security Operations: Reducing toil” (or it’s early version “Kill SOC Toil, Do SOC Eng”), your Security Operations Center (SOC) can learn a lot from what IT operations learned during the SRE revolution. In this post of the series, we plan to extract the lessons for your SOC centered on another SRE principle — evolving automation.

First, for many security operations teams automation in a SOC is about saving time by automating routine tasks. To me, this constitutes current conventional wisdom about automation in your SOC. However, in this post, we want to reveal a broader truth about automation in your security operations activities, drawing the lessons from the field of site reliability engineering (SRE). Naturally, I am not an SRE, but I feel that my analyst experience makes me qualified to translate or “port” the findings from their domain (SRE) to ours (SOC) — and to make new discoveries in this process too.

Security operations materials often point out that automation is a force multiplier, not magic. The SRE book says the same: “For SRE, automation is a force multiplier, not a panacea.”

However, the book also adds that “multiplying force does not naturally change the accuracy of where that force is applied.” This reminds us that automating a broken process often makes it more broken, but also that automating something that isn’t game-changing or systemic for a SOC would make you slightly better, if that.

By the way, this is why the most common starter SOAR playbook is about phishing, a major time-suck of many aspiring SOCs (I’ve heard one spent 40% of analyst time on phishing response and that was after the email security gateway did its work).

So people often point out that the value of automation is about saving time. However, both security operations center practitioners and SREs agree — consistency is also a big part of such value (“What exactly is the value of automation? Consistency!”). In fact, “automation provides more than just time saving, so it’s worth implementing in more cases than a simple time-expended versus time-saved calculation might suggest.” Think about it — it’s not only about saving time, scaling (“scale is an obvious motivation for automation”), but also consistency of what gets done whenever it needs to be done. By the way, how to make the security processes consistent yet allow for creativity, such as threat hunting? We will explore this in the next SOC paper in January.

Speed does come up a lot in SRE discussions of automation, after all “humans don’t usually react as fast as machines.” In the past, I largely implied (even in 2009) that sub-second speed matters little in security, especially in the day and age of 200+ day response timelines. Guess what? With ransomware, speed does matter. If you detect it via a ransom note, it won’t matter how good your SOC was …

To summarize, the main lesson from SRE is that “the factors of consistency, quickness, and reliability dominate most conversations about the trade-offs of performing automation.” These lessons work well when starting to make your SOC scale faster than the threats.

Further, I picked up a particular new insight from the SRE book, namely that automation separates the operation from an operator (“Decoupling operator from operation is very powerful.”). Why is it good? Glad you asked: “once you have encapsulated some task in automation, anyone can execute the task.” What does this solve? Some of the talent shortage problems in your SOC! This again gives us a chance to scale faster than the growth of threats and assets.

Here is another very useful reminder for your SOC from the world of SRE: “automatic systems also provide a platform.” What does it mean? That script you wrote is not a platform, even if it automates something. The way I think about it, the platform is a programmable entity, a base to develop other cool things. This means you have a chance to go for a more systematic automation of your current and future SOC activities.

Also, the SRE world delivers a very fun, slightly paradoxical, consequence: “A platform also centralizes mistakes. In other words, a bug fixed in the code will be fixed there once and forever” Think about it for a second! This is not about SOC being a great place to come and make mistakes … this is about the fact that you go to ONE place to look for mistakes, rather than chase them over 50 tools and 200 regional offices. Centralizing mistakes is awesome — and a new thought for me (and, I am assuming for many SOC practitioners as well).

Finally, “automation as a platform” leads us to metrics: “a platform can export metrics about its performance, or otherwise allow you to discover details about your process you didn’t know previously.” As you can guess, this delivers sizable — and positive! — implications for your SOC, given how hard security is to measure in general.

To my surprise, our SRE colleagues also pointed out a few negatives of automations. Now everybody likes to point out that problems with automation stem from automatic systems causing damage. This can happen both in the operations realm and of course in our beloved domain of cyber. Google SRE book describes beautifully horrible examples where many production systems at an, ahem, major tech company were deleted by automation, reimaged straight to demagnetized dust with enviable scale and effectiveness …

Now, what are the lessons? Here is a new idea as well: “Automation needs to be careful about relying on implicit “safety” signals.” What does that mean in a SOC? Well, a classic example would be blocking access based on badness, without checking for business criticality. We imply that it is safe to block access, but do we have an explicit “this machine is OK to auto-block” list? This is safe to shut down? This is safe to block access to? Using explicit safety signals for automation is a useful insight for me.

I have learned other challenges that are relevant to the world of security operations, which frankly I haven’t thought about. Many SOAR users complain that when the security tools change, EDR vendors change, APIs, logs change and other technologies evolve, their SOAR systems don’t always follow quickly enough. This is a well-known problem in the world of SRE: automation “being maintained separately from the core system therefore suffers from “bit rot,” i.e., not changing when the underlying systems change.”

Another lesson that we are starting to see in many security operations centers is that those automations that are infrequent, such as playbooks run upon seeing rare attack indicators are difficult to test. “Automation that is crucial but only executed at infrequent intervals and therefore difficult to test is often particularly fragile because of the extended feedback cycle.” It is easy to refine an efficient playbook that runs 10 times a day, but it’s much harder to run and refine a playbook that is supposed to help for a particular type of an advanced attack and may run twice a year, if that. How do we fix that? With more automation — test automation and simulations in this case.

Another great idea for your SOC is hiding deep inside the book. This has been characteristic of many leading security operations centers, and it has been discussed in many detection engineering articles, but it is definitely NOT common at many mainstream SOCs: “The most functional tools are usually written by those who use them.” This is why in our ASO workshops we explain that “SOC analysts” and “detection engineers” must go … and become one, or at least work together closely. “DevOps” your SOC!

We promised to discuss not just how to automate, but the evolution of automation. Here the news from the world of SRE is the most exciting. They chart a path of arriving to the autonomic system (that does not need extraneous automation) by starting from a manual approach and then evolving to automation. Here is what the book says:

  1. “Operator-triggered manual action (no automation)
  2. Operator-written, system-specific automation
  3. Externally maintained generic automation
  4. Internally maintained, system-specific automation
  5. Autonomous systems that need no human intervention”

While some of the above are not obviously related to SOC, let me try this:

SRE to SOC translations

That last step is interesting for sure. There is a lot of fun, thought-provoking stuff in SRE thinking related to “autonomous systems.” For example, they say that “software-based automation is superior to manual operation in most circumstances, better than either option is a higher-level system design requiring neither of them — an autonomous system.“ They further explain that non-autonomous “where automation replaces manual actions, and the manual actions are presumed to be always performable and available just as they were before.”

Finally, this is a security blog and so I was meaning to end it on a depressing note (compliance!). But then it turned out that SREs do “black humor” almost as well: “If we are engineering processes and solutions that are not automatable, we continue having to staff humans to maintain the system. If we have to staff humans to do the work, we are feeding the machines with the blood, sweat, and tears of human beings.” How is that for SRE noir?

Next we plan to dive into the SLOs of SREs and see what we can learn to make the SOC better!

A fully cooked version of this will be published on a Google Cloud blog, but I’d love some feedback and comments from either security or SRE side…

Thanks to Iman Ghanizada for ideas and brainstorming.

Related posts: