Elevating Security Alert Management Using Automation

Published in

Brex Tech Blog

13 min readJan 26, 2023

Robots working on an assembly line, generated by AI.

This is an in-depth post that describes the Brex Detection and Response Team’s approach to managing and automating security alerts at scale, and we hope that it inspires other teams in the industry to take their security alert management to the next level using automation!

This post covers the following topics; feel free to jump around and reference them as needed:

What Is Alert Management?
Requirements for Scaling Alerts
Blueprint for a Loosely Coupled Alert Management System
Contextual Suppression and Deduplication
Managing the System
Artisanal and Mass-Produced Automation
Building Automation with Tines
Easing into Distributed Alerting
Labeling, Metrics, and You
Choosing Impactful Metrics
Monitoring and SLOs
Why Teams Need Automated Alert Management

What Is Alert Management?

Detection and Response teams manage and build systems that generate alerts and are responsible for triaging those alerts for malice — alert management is what happens between the generation and triage of an alert.

Most alert management systems are simple — they present alerts to analysts as quickly as possible — and very few give teams what they actually need: reducing toil through automation. Although these types of systems used to be common many years ago (Sguil is a great example), they are now tightly integrated with SIEMs and have limited use cases outside of the SIEM. Alternate methods of alert management have been proposed over the years, such as distributed alerting, but these are usually presented in a “draw the rest of the owl” style blog post that doesn’t go into detail on the requirements and systems needed to elevate security alert management. This post is our attempt to help you draw the rest of the owl.

1. Read a blog post about distributed alerting 2. Scale distributed alerting across dozens of unique detection systems and 1,000s of diverse employees

Requirements for Scaling Alert Management

Our journey to elevating alert management started with facing our own inadequate practices: we used to simply dump every alert to a Slack channel. This solution may be fine when your team has a few alerts, but it breaks down quickly as the volume, diversity, and scope of alerts change. The Detection and Response Team at Brex manages 100s of custom detections, and it didn’t take long for us to run into this problem.

How do you assign alerts in Slack? Follow up on them? What happens when the channel receives duplicate alerts over and over? How do we temporarily stop alerts from appearing without modifying the detection system creating the alerts? What about triage automation, how would that work? As we began and continued to build custom detections, it became clear to us that Slack is not an “alert console” — there is an inability to manage, organize, and collect metrics from alerts, all of which are requirements of an alert management system.

Before we designed our system, we spent a lot of time discussing our pain points and the problems it needed to solve. We came up with many requirements, but these are the most important ones:

Must be able to ingest alerts from any detection system
Must contextually suppress and deduplicate alerts in the system
Must support triage automation, including distributed alerting
Must have robust labeling and metrics

It’s worth calling out that before we designed the system we evaluated the offering from our SIEM vendor — it addressed almost none of our requirements and, based on my time in the industry, I don’t believe that any other vendor would have fared very well either. Let’s look at each of these requirements more closely by describing the system we’ve built and how it helps us scale.

Blueprint for a Loosely Coupled Alert Management System

Like many of the systems we build, our alert management system is loosely coupled and gives us flexibility to extend it in ways even we can’t imagine today. This diagram describes the system’s high-level design:

Diagram of a loosely coupled alert management system.

The system has two points of loose coupling: ingest and automation. The loose coupling on ingest ensures that we can receive alerts from any detection source (we most commonly utilize webhooks), while the loose coupling on automation allows each alert to dynamically call automation, including distributed alerting, as needed. The system uses a normalized schema for alerts (JSON that contain an alert name and payload); if alerts don’t natively meet the schema, then they are pre-processed before being introduced into the system.

It’s also important to mention that everything else is tightly coupled on purpose:

All alerts are assigned standardized metadata for tracking and metrics.
All alerts are checked for suppression and deduplication.
All messages sent to users have an identical shape (but not content).

At this point it’s worth mentioning that we’ve built this system using Tines, Jira (Cloud), and AWS DynamoDB. These components aren’t requirements for building a similar system — if you swap in your preferred SOAR / automation platform, ticketing system, and key-value database, then everything will work just the same.

Contextual Suppression and Deduplication

Context — every security analyst says they need it, but everyone seems to have a different definition for it. If you’ve ever worked an alert queue and thought to yourself, “I wish I could stop these alerts from appearing right now” or “Why am I looking at activity that someone else is already triaging,” then this section is for you — within the first two weeks of deployment, this feature of the system reduced our alert volume by 25%, saving 3 to 4.5 hours of manual effort.

In our alert management system, “context” is information derived from the alert payload that is used as metadata for suppression¹, deduplication², and metrics. Reduction of toil in the system is primarily attributed to its ability to use context to stop wasteful alerts from getting to the team.

This creates the opportunity for the team to, for example, suppress alerts that we know require tuning by a detection engineer or ignore duplicate alerts for activity that is being investigated but may be on hold while we wait for additional information. These alerts are never dropped — they still flow through the rest of the system and generate a ticket — but they are not assigned to a person for triage.

All of this is possible because we generate context for every alert from two sets of information: the alert name and fields from the alert payload. For example, if the system receives an alert like this …

{
  "alert_name":"CornyRat C2",
  "alert_payload": {
    "client_ip":"1.2.3.4",
    "server_ip":"5.6.7.8"
  }
}

… then it generates contextual metadata as a SHA-256 hash by parsing the name and payload.

If the alert is configured to use the context from the server_ip field (the C2 server), then the name and the context field’s value are joined and hashed to create a “contextual identifier.” Any field from the payload, including multiple fields, can be used, too; if the context were client_ip and server_ip instead, then the contextual identifier changes. Refer to the table below for the three configurable permutations for this alert:

Deriving contextual identifiers from an alert (name and payload).

Since these identifiers are created for every alert when they enter the system, it becomes trivial to check for an alert’s contextual suppression and deduplication state. We use DynamoDB as a key-value database for managing system state — alerts enter, identifiers are generated, and DynamoDB is checked to see if the alert is either suppressed or if it should be deduplicated based on context.

This feature excels as a solution because of its simplicity:

Generating identifiers is invisible to the end user — analysts speak in readable text (like CornyRat C2), but the system speaks in SHA-256 hashes.
Deduplication is fully automated — it just happens, like magic!
Suppression is a single-click action for analysts, and they can choose to temporarily suppress alerts globally (by alert name) or contextually (by contextual identifier).

All of this is given to the analyst in a plain, boring Jira task — unlike a fancy SIEM, this gives the analyst exactly what they need and nothing more:

Managing the System

Managing a system like this is just as important as using it, and we manage ours in three places:

Detection build pipeline
Team Slack bot
Management console built in Tines as a webform

Our build pipeline is where we configure alert context and automation using code (we’ve adopted the TOML format used by organizations like Elastic), but everything is also configurable using our management console:

Webform built in Tines for updating alert metadata.

The same actions are available in our Slack bot. Here’s an example of retrieving alert metadata:

Retrieving alert metadata using the team Slack bot.

Artisanal and Mass-Produced Automation

Although this is an automated system, we refer to “automation” as anything that interprets and processes an alert before it is assigned to a responder.

This can be literally anything that the responder needs, but usually includes some combination of these actions:

Collecting and summarizing information (such as user information)
Providing automated response options (such as host isolation)
Reaching out to an impacted user for confirmation (distributed alerting)

My favorite example of this is our “tracer alert” that the system periodically generates as a wellness check to ensure everything is working correctly: it randomly selects a food and topping pairing, then the automation gives the responder a link to buy the topping on Amazon.

Now that you mention it, cajun spice on a hot dog would be pretty tasty 🤔

Building Automation with Tines

On the backend, the alert automation is loosely coupled in Tines and uses the Atlassian Document Format to create comments. Although the example above is silly, the loose coupling allows any alert to opt in to using any automation. We typically have two types of automation:

Automation that is specific to a single alert (like the example)
Automation that applies generically to several alerts (e.g., collecting information about a compromised host)

Below is our food pairing automation from Tines. Note that the automation always has an outcome, and if the outcome is not “triage,” then the alert isn’t assigned to a responder for review.

Loosely coupled alert automation built in Tines.

The overarching goal of the system is to avoid wasteful alerts, and automation is a major part of that.

Easing into Distributed Alerting

Automation is also where we introduce distributed alerting, although our approach differs slightly from others you may have read about. We’re easing our way into it because we want to make sure it’s a good experience for our team and the company, and have a phased rollout strategy that is based on the focus and attribution difficulty of alerts:

User-centric alerts
Host-centric alerts
Infrastructure-centric alerts

We chose this strategy due to the increasing difficulty in attribution and assigning the distributed alert to a specific person for review. For example, it’s easy to attribute anomalous login activity to a specific user, but much more difficult to attribute anomalous commands executed by a cron on a cloud resource to the team responsible for that resource.

This is an example of what someone sees when the system sends them an alert:

An example distributed alert sent by the system to a user.

There are several points to call out about how we present this to the user:

The message’s shape is always the same, but the details change for each alert — the goal is to train our users to recognize when a security alert needs their attention and to keep them familiar with our alerts even if the details change.
We make the details user friendly — this alert may actually be called “geo_infeasibility” in the backend, but we translate it to something more understandable for the user.
Users only ever have two options — confirm or deny, yes or no, pizza or hot dogs.
Multiple techniques can be used to validate the user’s input, including push authentication and device attestation, after they’ve clicked “Confirm.”

We use a single Tines story for all user interaction (if you’re a Tines user and would like to know more, then keep an eye on the Story Library — we’ll be publishing some of our work this year); everything you see in the screenshot above is customizable by the security engineer who is building the automation. Here’s an example I recently sent to a co-worker asking which cartoon cat they like more:

Notice that the shape of this message is the same as the distributed alert, but all of the content is different! Creating messages like these only takes us a few minutes, so it’s very easy to create unique, easily understood distributed alerts that have company-wide reach.

Labeling, Metrics, and You

As easy as it is to suggest that lack of automation is the biggest failure of most commercial alert management systems (again, usually packaged with a SIEM), another failure is in labeling and metrics. Most systems classify alerts into one of two states, true positive or false positive, even though anyone who has ever worked an alert queue knows that the reality is not that simple. We’ve adopted the SMAC methodology proposed by Rapid7 to address this problem.

SMAC (Status, Malice, Action, Context) correctly addresses the diversity of activity that is described in security alerts. For example, a single alert can simultaneously be a true positive (the alert correctly identifies the activity of interest), benign (the activity is not a risk), and require follow-up (the team has to take an action to inform others about the activity). We implement SMAC in Jira using a combination of custom fields and labels; status and malice are fields, while action and context are labels.

Choosing Impactful Metrics

Metrics, on the other hand, can be even more complex — they can have a significant impact on the perceived performance and value of a team. Who are the stakeholders and what do they need to know? Here’s what we came up with for our initial operational metrics (OMs) and key performance indicators (KPIs):

Number of alerts generated, by alert name (OM)
Number of alerts suppressed (OM)
Number of alerts deduplicated (OM)
Time saved using automation (OM)
Mean Time to Detect³ (OM)
Mean Time to Acknowledge⁴ (KPI)
Mean Time to Resolve⁵ (KPI)
Percent of alerts that utilize automation (KPI)

The biggest takeaway from our metrics is that our team optimizes for speed and scale (indicated by our KPIs). Also consider what is not present in these metrics — there is no metric that explicitly indicates “quality of alert” (such as percentage of true positive alerts); instead, we measure alert volume to ensure that the team is properly staffed to address activity we observe, and if the volume becomes too great, then we know that changes need to be made (e.g., tuning alerts, refactoring alerts, building new automation).

The team at Expel have an excellent series on SOC metrics that are a must read for anyone who is considering building metrics for systems like this.

Monitoring and SLOs

By now you probably have the idea that this system is somewhat complex, and for that reason it has to be monitored like any other production system. We determined that systems like this can have their failure scenarios distilled into one easily identifiable characteristic: if something comes in, then something must come out.

With that in mind, our primary method of monitoring relies on a Service Level Objective (SLO) that is measured using this SLI:

Number of alerts processed / Number of alerts submitted

For example, if 100 alerts come in and 90 alerts come out, then the system is operating at 90% reliability — far below our standards. We generate the SLI by emitting metrics at critical points in the system, including:

When an alert enters the system
When an alert is suppressed (exit)
When an alert is deduplicated (exit)
When an alert is assigned to a responder (exit)

The metrics use a simple JSON schema and are sent directly to our SIEM for monitoring. Here’s an example that would be emitted when an alert exits the system via suppression:

{
  "alert_id": "374d84e48e320eb5f708891f95986baea0e0d6924da90ea601dc02f4e323794e",
  "state": "suppressed",
  "timestamp": 1674574173
}

The system stores event data in memory for several days (thanks, Tines!), so debugging and remediation is very straightforward.

Why You Need Alert Management

Continuous change is one of the few constants in security, and that includes real-time monitoring and threat detection. Detections that are perfectly crafted and produce nearly zero false positives today could completely break down tomorrow due to changes in the business environment or user behavior. If you don’t practice good alert management, then it is difficult — if not impossible — for your team to defend against these inevitable changes.

We experienced this first hand at Brex — we watched our alert volume grow 10x over the past year with little input from our team. Knowing that the business would not stop growing or diversifying, we decided to tackle the problem holistically using automation.

Using data pulled from our system over the past two weeks, we’ve seen these measurable improvements:

20% of alerts were handled purely using automation
16 hours of time saved by automation
5 alerts represented 56% of all alert activity, indicating high return on investment tuning opportunities

[1]: Suppression is an action taken by a person that temporarily overrides the system and prevents alert generation.

[2]: Deduplication is an action taken by the system to prevent alert generation if the same alert is already active within the system. For example, if an alert was assigned to a team member for triage and another alert comes in, then the second alert is deduplicated.

[3]: Mean Time to Detect, also known as MTTD, is the time between initial activity and alert generation.

[4]: Mean Time to Acknowledge, also known as MTTA, is the time between alert generation and acknowledgement.

[5]: Mean Time to Resolve, also known as MTTR, is the time between alert acknowledgement and resolution.