Prioritizing Detection Engineering
Detection Engineering is a concept that has emerged in the detection space. It acknowledges the complexity of a detection stack and the workload it quickly creates for others. Detection Engineering is a nexus of threat intelligence, technical capabilities, platform opportunities, and management toil. I’m writing about the last part, management toil.
Detection Engineering is more than code and platforms; It’s an awareness of the flow of new work suddenly created by your efforts. I discuss those management risks at length in the previous essay, Lessons Learned in Detection Engineering which was published seven years ago.
It brings me great joy to say that essay has influenced the detection engineering space. Let’s extend that essay with thoughts on how detection is prioritized within a security program.
This essay advocates for a very intentional prioritization for Detection Engineering alongside all of your other endeavors in Security.
First, let me propose an order of implementation for detection projects.
1. 🚀 Get logging in order.
The long-term vision for centralized logs should be a detection strategy, but the short-term should mainly focus on query-ability and the inclusion of other teams’ logging needs. There should not be dedicated detection engineering roles yet.
First, we must be selective and collect our minimum viable logs to reliably respond to common incident scenarios. Collecting authentications in corp and prod, along with infrastructure usage (IaaS logs), is usually prioritized here.
Additionally, ingest whatever application logs the rest of the org uses to troubleshoot, mostly up to the product. For the time being, we will avoid a greedy attitude towards logs. We don’t need to store every syscall across every server (for now!). There will be quite a bit of time before a greedy strategy will have any ROI.
These conversations force the debates on budget, storage, querying, retention, and log-forwarding. However, pause the conversation around alerting. Don’t go too far down the road on sophisticated alerting strategies yet. At least, don’t hold strong opinions on this for a while, and don’t lock into commitments where we wake everyone up and create a SOC or an on-call pager rotation.
This step supports the second factor in securing systems. It increases development velocity, creates multi-stakeholder support, won’t overwhelm detection tasks, and grants an important response capability while improving your understanding of the environment.
Now, at the very least, you can understand what happened in an incident. A pause here naturally happens before we venture farther into a detection strategy.
2. ✋Spend time on hardening and plan to come back to detection.
Custom detections have a greater return in an environment with a certain amount of controls and segmentation. Small teams shouldn’t formalize their long game detection and alerting too early. The controls you can introduce are consequential to what you will detect. Instead, let your milestone be query-ability and collaboration before getting into alerting and custom detections.
We aren’t staffing a dedicated detection engineering team at this point. There should still be much supporting work remaining to make a predictable environment built around invariants. Advances in detection do well when these invariants begin to form.
Invariants are hard-set expectations about how you expect your environment to behave. This can include reduced privileges, centralized forms of authentication, and intentionally designed employee identity and service credentials.
Some exceptions: Early security tools may come with built-in detection and alerting capability. These are fine. They’re usually manageable, so long as you’re not in the weeds trying to accelerate the amount of work they produce. Avoid the urge to create an aggressively SLA’d on-call as soon as detections become available.
View detection at this point skeptically. Imagine a pre-mature detection program like a gravity well. The hardening tasks that precede a detection program will make the end result very special. Though until then, premature detection sucks up time away from making that outcome possible.
3. ⏭️ Introduce strictly high-quality detections and alerts.
First, circulate lessons from the Alerting and Detection Strategy (ADS) framework. This is a high enough bar for detection and alerting quality. Someone who wakes up to an alert should expect the rigor that ADS demands behind the detection that caused it.
We want to prevent mediocre homegrown alerts from piling up. Start by documenting and collaborating on the very first one. The first alert is your reference alert; perfectly documented, low/no false positives, clear-cut response scenarios, and interpretable by any handler who receives it. Go overboard on the first alert and compare everything else to it. It’s the flagship alert for your program.
The ADS offers guidance into what should be documented before an alert enters a production detection system.
The first batch of alerts should focus on detecting invariants. Your infrastructure should never see these situations based on assumptions you support with controls and policies.
👉 Example: The secrets management service should never unseal.
Production secrets may be deeply segmented with limited access. Yes, there are reasons to unseal a vault. But there may be an expectation that all unseals should kick off an incident, which would make an alert OK.
👉 Example: No remote IaaS API usage.
You (hopefully) require all credentialed activity to come from specific subnets. If your infrastructure supports this (IAM policy, network segmentation, bastion hosts), unauthorized usage outside these networks would strongly indicate a credential leak (Why are keys being used from a random VPN?). Again, this is usually an OK thing to alert if the underlying invariant is supported. Otherwise, we have fundamental work to do.
👉 Example: No one should touch honeytokens.
Yeah, that’s obvious. Honeytokens shouldn’t be touched as they’re smartly placed outside the expected developer path.
Next, you want to have response capability from your platform.
Response capability is created when you can introduce an IOC (a malicious hash of an executable, domain, egress IP, etc), or TTP (a particular technique) into your detection stack and have it immediately alert observations to a response team. That’s it. Your detection platform offers a watchful eye during incident response by watching for known indicators.
Lastly, Any detections you want to avoid graduating to an alert can be considered hunt data. Some hunt data may never graduate to an alert, but it may still be handy.
A lovely example of a recurring hunt is having periodic uniq | sort
on all user agents that appeared in infrastructure over a while. If a new user agent appears, it will either inform the goings-on of infrastructure or reveal some obscure GUI tool an adversary uses to browse your infrastructure, like Cloudberry, elasticwolf, or s3 browser. New user agents aren’t worth waking up for, but they are an informative tool for someone looking to hunt for weird stuff. (Oh, look, a new data product is being used in production).
We’re still not ready for a dedicated detection engineering team. We are still fighting for generalized staff. This limitation is intentional, as you’ll soon see.
4. ✋ Spend time on management and plan to come back to detection.
This is where quickly growing security teams trip up hard. Engineers build detections, get excited, send them to a newly formed on-call concept, and everyone buries each other with their cool detection ideas while nothing gets detected. Detection becomes a growing fire that burns everyone out.
Many things must precede an entire investment into a detection engineering program. All of them are tied together by operational management problems.
Detection has a classic operational management pattern that exists across Security.
These core questions must be settled before anyone commits to an entire detection program (And is the primary basis of the original detection engineering essay)
- What does it take to “productionize” an alert?
- How much noise do we tolerate in an alert?
- How close will detection create-er and alert close-er work?
These aren’t new questions. We do the same for vulnerability management (What is critical, really?), compliance churn (How many checklists do I fill out?), and offensive work. Detection is especially vulnerable to flying wild.
5. 🏁 Fully embracing an engineering approach to detection.
Full-blown detection engineering makes the most sense when the work it produces can be throttled or accelerated without disrupting your other goals. Every step is intentionally taken to manage the work involved.
Here’s how we accelerate:
- Introducing externally created detections and products.
- Promoting validated ideas from hunts into alertable detections.
- Eliciting scenarios from the team: Offense, threat intelligence, compliance, and risk assessments.
…or slow down:
- Demoting alerts that produce false positives or unnecessary work.
- Forced retrospectives on false positives with their creators.
- Sharing notes on peer review on alert creation.
You can’t get to this phase before we feel confident throttling the work it will inevitably produce.
Staffing priority for detection engineering
Why was I so conservative on staffing?
First, detection engineering may never need more than one dedicated headcount. Detection pipelines are not the nightmares they used to be at younger companies, especially fully cloud-native ones. Hosted tools are increasingly simple and collaborative.
Only staff a detection engineering team after significant progress fundamentals. The tools to mitigate the management risks need to be mature as well. There are many reasons to expand headcount here, but not if it will instantly create management risk.
…What management risk?
If a healthy management environment hasn’t already been established, detection will immediately create the wrong kind of stress. In response, you don’t want to build a shadow management structure with separate task management, severity levels, escalation patterns, etc.
Task management and prioritization are challenges everywhere. Find your organization's existing threads on these issues, collaborate, and build on them rather than invent them. Talk to managers and discuss how work flows through the company. Detection engineering creates lots of work. Early headcount grows this problem faster than you can fix the underlying management issue.
Detection engineering is cursed with more management work than people realize.
Management is the top challenge of a detection program. So, prioritize the fundamentals outside of detection first. Using invariants simplifies the overall detection space and builds confidence that your management strategy is compatible with the company’s existing management philosophy.
Demonstrate your capabilities with the incremental milestones I described. Then, see how much work you must manage and if you can effectively tune it. If you still need to, staff it.
Maybe you disagree. You may think detection engineers can come in and build a full-stack detection program from scratch. That is backward. Sure, it may be possible with the right people involved. It’s more likely that anti-patterns will form with each new management concept you’ve suddenly introduced that is redundant with an existing one.
Why was this the correct prioritization for detection?
The work involved in detection can snowball; here’s why.
Detection is a problem I describe as deceptively tractable. Several points lead me to this conclusion. A variety of anti-patterns emerge when detection is made the premature centerpiece of a security program:
First, detection work operates independently. A detection platform is often drawn strictly from the security budget and avoids engineering-wide approval. A detection outage rarely impacts production, and the larger engineering organization may barely know it exists.
Second, detection needs less production expertise.
A lot can be done in detection without the complete expertise of an application stack, product, or infrastructure. A detection engineer can specialize solely in detection problems and be helpful at many companies. This portability means you can progress on detection without getting into the same weeds your engineers have entered in production.
Lastly, detection demarcates victory. The news that we caught an attack in progress is an instant cause for celebration. The anticipation of this victory is genuine and makes work very fun. Security is usually thankless, winless work. Detection is not. You catch them, you win.
This may sound so nice that you want to stop what you’re doing and work strictly in detection. Detection is an appealing, safe space to work with an attractive victory scenario. Imagine a guerilla workstream where a security team that isn’t finding leverage or budget to work on production mitigations can move fast in a well-contained area.
Organizational issues often force detection into the forefront. Perfectly justifiable situations support detection as a strategy. Building dedicated, functional engineering resources for each area of a complicated business may not be possible or efficient. Engineering boundaries may take risks entirely out of Security’s jurisdiction, leaving detection as one of the rare ways a security team can contribute to mitigation.
Unless you have a solid reason, overly indexing on detection strictly takes from mitigation work that requires extensive collaboration, production consequences, and technical parity with your builders. Leadership gets sucked into this. A team’s destiny slowly transforms into a shadow-engineering organization rather than being indistinguishable from engineering.
Don't make this mistake if you are building a security team from the ground up. Prioritize your work with engineers on mitigation solutions in parity and consider detection a component of the whole picture. Don’t let detection become a safe, professional alternative to working on production mitigations, which often asks for more direct accountability from us.
Ryan McGeehan writes Starting Up Security on scrty.io