Detection Engineering the SOC: Designing an Incident Response Playbook

Ryan G. Cox - The Cybersec Café
7 min readFeb 21, 2024

--

Hey, RCX Security here. My blog has moved over to Substack as of April 2024 under the Cybersec Cafe. If you are here to sign up or keep reading, go here instead.

Thank you for the support!

Welcome back to the second article in the mini-series, Detection Engineering the SOC! In Part II, we’ll be deep diving into the mind of a Security Engineer taking a detection through the Detection Lifecycle (DLC). This time, specifically through the creation of an Incident Response playbook.

If you want to jump around in the series, you can view the table of contents below. Make sure to save the post and subscribe as I’ll post the final part in the coming weeks. Or, if you want to get ahead, you can also find the series already publishes on my blog, the Cybersec Cafe.

  1. Writing a Detection Rule
  2. Designing an Incident Response Playbook
  3. Building a SOAR workflow

Looking for daily cybersecurity content? Check me out on Twitter/X!

But, just in case you missed Part I, let me give you a quick recap:

We had a use case for a fictional company that needed an Alert to trigger whenever a login was made to their AWS instance without using Multi-Factor Authentication (MFA). This was because there were some highly-privileged developers that needed to access the console outside of MFA to access some administrator privileges. But, accessing the console without MFA could also be an Indicator of Compromise (IoC), so we wrote a detection by implementing the following pseudo-code:

  • Check if a ConsoleLogin Event was Successful
  • Check if the User logged in without using MFA
  • SAML is not present in the log

If you want to see the step-by-step on how we got here, you can refer back to Part I and then come pick back up right here!

After deploying the detection to our production environment, the next step was to wait for our alert to fire off. And, just as suspected, it did! But now we’re left in a state of wondering, “What now?” We were able to detect on the activity we wanted, but now what do we do with the alert?

What Now?

This is where Incident Response (IR) playbooks come in! IR Playbooks are documented steps and processes to triage an incoming alert to ensure that no malicious action was taken. Or, they’ll also have the steps needed to escalate an alert to a Security Incident in the case that malicious activity was detected. The thought process behind this is that there needs to be a Standard Operating Procedure to follow for every alert. As a Security Operations Center (SOC) scales, so will the amount of detections, which in all likelihood means an increase in the amount of alerts as well. It would be a nightmare to keep the triage steps in your head, and also not ideal for onboarding new members and scaling the team. Not to mention, the continuous threat of alert fatigue, but that’s a topic in itself.

Like I mentioned above, the IR Playbook generally has two routes:

  1. Steps to take in order to close the alert and mark as non-malicious activity.
  2. Steps to take in order to escalate the alert to an Incident if malicious activity is detected.

Remember, not all alerts are engineered the same. Just looking at our current scenario, we’ve created a detection that will detect on standard activity, so we need an SOP to verify that the activity was expected. However, the activity this detection is built around can also be considered an Indicator of Compromise (IOC), so we also need an SOP for escalating the alert to an Incident.

On the flip side of the coin, some alerts are activities that should never happen from regular activity, and these Indicators of Compromise can be escalated immediately to an Incident. So, in this use case, an SOP to verify the activity may not be necessary.

Detection Engineering and IR Playbooks

So, for our Detection we crafted in Part I, we’ll need to create an SOP for Analysts to take while triaging these alerts. But how do we craft an IR Playbook? Well, an IR playbook will generally consist of the following:

  1. Alert Title: A high level description on what is happening.
  2. Summary: What kind of activity triggered this event? A more descriptive version of the title.
  3. Initial Steps for Triage: Where should an Analyst look to investigate the activity?
  4. Quick Links: Allow the Analyst to easily find related resources, IP lookups, or documentation.
  5. Saved Queries: These are premade queries crafted to investigate the activity in the SIEM. These should also be quick links. But it’s also important to document the query inline to explain how to use it. Make it so that it filters for only needed columns, and allows the Analyst to quickly fill in the necessary fields.
  6. Response Actions: Document what standard behavior looks like and how to verify actions to take the ticket to close. Also document the steps needed to escalate the Incident.

Crafting the IR Playbook

Now that we have an idea of what we need to create, let’s assume that the following alert has come through from the log, and craft an IR Playbook around it.

Alert Title: AWS Login without MFA successful by User RCXCybersecCafe

Alert Log:

{
"accessKeyId":"fahsdjklnjkllnasd",
"accountId":"asdf8ae3hlas",
"awsRegion":"us-west-2",
"consoleLogin":"Success",
"eventCategory":"Management",
"eventId":"adsf38-3n8d-3nd1-9d83-asd83nfla",
"eventName":"ConsoleLogin",
"eventSource":"signin.amazonaws.com",
"eventTime":"2054-03-14T23:12:12Z",
"eventType":"AwsConsoleSignIn",
"eventVersion":"1.10",
"ip":"8.12.54.2",
"managementEvent":true,
"mfaAuthenticated":false,
"mfaUsed":"No",
"readOnly" false,
"recipientAccountId": 234789104,
"type":"AssumedRole",
"userIdentity":"RCXCybersecCafe"
}

Incident Response Playbook for Successful AWS Login without MFA

  • Title: Successful AWS Login without MFA
  • Summary: A console login was made to AWS without using MFA.
  • Initial Steps for Triage
  • Check the user against the allow list
  • Check Out-of-Band Communication Channel for access requests
  • Investigate recent activity using SAVED_QUERY_1
  • Check the IP location against the MFA locations for the user using SAVED_QUERY_2
  • Investigate IP if needed
  • VirusTotal
  • Talos Intelligence
  • Internal Resources
  • Check last 10 non-MFA Successful Logins using SAVED_QUERY_5
  • Check history of previously triggered alerts in ALERT_HISTORY_SYSTEM
  • Response Actions
  • Behavior looks Standard
  • Ask user for confirmation publicly in channel.
  • Acknowledge behavior and close out the Alert.
  • Doubts about Behavior
  • Reach out to the Security Team using an internal, private channel. Ask if anyone is able to verify the access.
  • Verify use-case/business-case for the non-MFA access in a public channel with the user.
  • Review recent user logins, double check for suspicious activity or locations using SAVED_QUERY_3 and SAVED_QUERY_4.
  • Escalate to an official Incident with the Security Team. Lock down AWS console and account if deemed necessary.
  • Saved Queries (In this area, you would generally also write the Queries using the Query Language of your SIEM or logging system. Document what each piece means for easy understanding, and design the queries so an Analyst can easily remove values and input the values from the alert.)
  • SAVED_QUERY_1: Events from user OR ip address
  • SAVED_QUERY_2: Recent user Logins by IP
  • SAVED_QUERY_3: Recent IP logins by user
  • SAVED_QUERY_4: User actions last 48 hours
  • SAVED_QUERY_5: Last 10 non-MFA Successful Logins

Congratulations! We just crafted the Incident Response playbook for our Detection. As you can tell, the process of crafting an IR Playbook is not necessarily a difficult one, but definitely requires some thought. It’s important to consider all avenues needed to investigate the activity, and put yourself in the mind of an attacker: What would a malicious actor be attempting to do in this situation? How could the attacker laterally move, or vertically escalate privileges? How could we stop them? Here are some other good questions to ask yourself while drafting IR Playbooks:

  • What do I need to know to feel comfortable about this activity being listed as normal?
  • How can I determine if this activity is normal?
  • Do we have documentation anywhere useful to this alert?
  • What IOCs being present would point to malicious activity?
  • What actions need to be taken to mitigate the threat?

Crafting Incident Response Playbooks is a necessary process for any SOC, and heavily contributes to lowering triage times and scaling the team.

Closing the Ticket

Let’s take ourselves back to our scenario now. As the Analyst assigned to the incoming alert, we’re able to check on recent user activity using our saved queries and verify that this is common/expected activity for the user. It also seems like the IP address is consistent with their common login locations, so after verifying in a public channel that the access was intended, we can make note of the access and close out the ticket. Easy!

Moving Forward in the Detection Lifecycle

Now let’s move forward a bit, and assume we now have playbooks implemented for our detection suite, and alerts have been firing for some time now. Analysts are able to swiftly triage incoming tickets, but as the SOC continues to scale, alert fatigue is beginning to set in over the team. The number of tickets is slowly beginning to become overwhelming. You and other members of the SOC can’t seem to get any other work done due to the influx of these alerts!

As an organization, what do you do? Do you hire more Analysts to take on the incoming workload? You need to make sure the company is kept secure, but is taking on multiple new employees in the budget?

Don’t worry. That’s where Part III comes in.

Now that your SOC is firing on all cylinders, an influx of alerts is expected — that means you’re doing your job! But you can only go back to your alerts and tune your detection logic so much to fight off the alert fatigue. Now it’s time for the next phase, an extremely critical phase for scaling teams: Automation. Stay tuned for the third and final article in the Detection Engineering the SOC mini-series, where we’ll be continuing through the detection lifecycle and crafting automation workflows!

Enjoying the series? I’m covering everything Cybersecurity here on Medium, or on my personal blog, the Cybersec Cafe.

Looking for daily cybersecurity insights? Check out Pen Testing Walkthroughs, Bug Bounty Report deep dives, and more on Twitter/X!

--

--