Building the Threat Detection Ecosystem at Brex

Julie Agnes Sparks
Brex Tech Blog
Published in
5 min readDec 7, 2022

Introduction

The Detection & Response Team at Brex recently open sourced Substation, our toolkit for achieving high-quality data through interconnected, serverless data pipelines. However, Substation is just one part of our threat detection ecosystem, this blog post describes our approach to building a threat detection platform in a vendor-agnostic way that aligns with our overall detection vision.

At Brex, we see the Threat Detection Platform as all systems that contribute to the ability to create and execute detections in our environment — from ingesting and enriching logs using Substation to a centralized querying environment to processes that execute detections and optionally trigger interactions in an alerting system and a SOAR solution.

A systems diagram showing the connections and flow of data between Substation, Enrichment Frameworks, Context Databases, Alert Processing Functions, automation workflows and our SIEM interface.

Core Engineering Principles

The ability to detect threats in our environment is constrained by the availability and quality of data, and the success of each detection relies on the execution of enrichment, intelligence data, and optional automation in a timely manner.

We build modular, easily configurable systems relying on APIs and serverless systems where possible.

The most important concepts we focused on engineering for our underlying systems are:

  • A Unified Data Model (UDM) that normalizes all log sources for detection and investigation.
  • Enriching logs using contextual databases and external enrichment functions.
  • Data consistency across all interacting systems.

With this architectural approach, we can pivot to the best vendor or in-house solution based on pricing, capability, and/or business needs without having to change every part of our threat detection platform.

Our Approach to Threat Detections

At Brex, the Detection & Response Team is focused on building scalable detections that follow these overarching principles:

Detections should:

  • Be driven by a possible scenario of an external threat.
  • Focus on capturing attacker techniques rather than indicators.
  • Include a prioritization score from the risk of the system(s) and the likelihood of the attack scenario.
  • Produce low alert noise and be tuned quickly if creating alert fatigue for the on-call team.
  • Trigger automation workflows.
  • Be unit tested and regularly reviewed by the team.

These principles are enforceable based both on how we’ve architected our systems and by decoupling how the detection itself functions.

We build our detections with a layered approach — they exist first as a collection of signals that represent an individual event or action which then are aggregated together to demonstrate a technique or attacker behavior.

Since signals represent behaviors of interest, they are also useful outside of detection engineering and can be applied to incident response, threat hunting, and threat intelligence functions. Analysts can review signals to reveal key events that were taken, pull out event level behavioral patterns, or assist in security investigations.

By building signals on one specific behavior and linking multiple signals together to model an attack, Detection & Response teams can create more flexible alerts.

For example, if an attacker is known to use various initial access methods but consistently take the same persistence actions, then this behavior can be easily modeled through combinations of signals into different alerts without rewriting the underlying detection logic for each slight technique change.

If you rely on static detections rather than a model similar to our signals based approach, then you’ll be missing out on the benefits of modular design principles — design reuse, rapid development, and time cost optimization.

The strengths of a modular and decoupled approach specific to threat detection are:

  • Detection content becomes reusable and requires minimal development work to deploy.
  • Adoption of a non-vendor specific detection as code approach allows the flexibility to switch vendors and incorporate open source research.
  • Detections can be executed on any ingested log source and capture behavior across all datasets by using a unified data model that is enforced in data pipelines.
  • Signals are easy to update when a vendor or system changes the way it logs an event. An engineer can update a single signal and it will propagate to any detection that uses it, which minimizes the overhead for small teams.

Life of a Threat Detection

Let’s say we want to detect when a user logs into the AWS Console from an unrecognized endpoint.

First, we build signals for each type of console login event that can occur and is logged through Cloudtrail, such as where user_id_type=IAM_USER.

Next, we create a detection that matches when any of those signals occur and filter based on additional criteria. Let’s remove known corporate endpoints and then allowlist a testing environment.

_index=dart_signals
| where signal IN (
"initial_access_cloudtrail_console_login_feduser",
"initial_access_cloudtrail_console_login_iamuser",
"initial_access_cloudtrail_console_login_assumedrole",
"initial_access_cloudtrail_console_login_root"
)
event_dataset=cloudtrail
NOT cloud_account_name=test_account
| where !(event_tags contains "known_device")

Now that the signals exist for each login type, another engineer can write a detection for a root console login to AWS and add additional behaviors, such as creating a new AWS user. The chaining of two or more signals represents an attacker technique to detect on.

Let’s walk through an example flow on how our detection will be executed.

The flow of how a detection is executed from the normalizing of logs to the execution of signal logic/code and querying to build our alerts.

After any console login events occur in our CloudTrail logs, they are stored in a Signals Database. Now when they match logic outlined by the alert, it generates an alert event that kicks off a triage task in Jira. The majority of our data enrichment is completed on the logs themselves during ingestion; otherwise, use of temporary database lookup tables can be used when building detections and referenced in the signal or alert to increase the detection logic’s granularity.

Future Work

As we mature both our detection engineering practices and the supporting systems, there are key requirements for properly deploying detections as code that the team will focus on implementing going forward.

A few areas of active development include:

  • Ensuring Detection Quality Assurance — Detections should be evaluated and deployed against test data to ensure they don’t break anything before going to production.
  • Improving Mean Time to Detect — Detections run in near real time after log ingestion to reduce the time to detect for potential security incidents.

Key Takeaways

Tl;dr Design so you can replace and swap out pieces as needed without having to rebuild everything. Bend the tools to your will.

Our approach to building threat detection systems is to abstract the capabilities that allow for high-quality detections and then adapt to the best platforms that are available and appropriate for the team.

Regardless of your tech stack, the holistic concepts outlined above for decoupling your detection systems can be applied to make your detection engineering more flexible.

The technology we initially spec’d for our architecture has dramatically changed over the last three years; however, the capabilities have stayed the same.

Big shoutout to the Brex Detection & Response Team that has planned and built our current approach and related systems.

--

--

Julie Agnes Sparks
Brex Tech Blog

🇺🇸 🇭🇺 History Enthusiast. Detection & Response Engineer. Currently Security @Datadoghq // Previously @Brexhq @Cloudflare