Detection Engineering Metrics Building Blocks

Series: Measuring the effectiveness of a detection engineering program

Gary Katz
7 min readJun 27, 2023

This blog series is based upon excerpts from a book I am writing with Megan Roddie and Jason Deyalsingh.

True positives and False positives are mentioned frequently within detection engineering and SOC metrics. In this blog post we will provide formal definitions for these and other building blocks which are a necessary foundation to detection effectiveness metrics. We will also use these definitions to calculate precision, recall and graph ROC curves to visualize the cost/benefit of changes in our detections. These terms and definitions are common in statistics and data analytics and thus the equations apply to any related field but the article will obviously focus on usage within detection engineering. This is the second post in my series on detection effectiveness metrics. If you would like to read the previous post on the Limitations of Mean Time to Detect and performing a basic grouping of your detections to identify which types of effectiveness metrics to employ, you can read that here.

When we discuss these metrics, we need to distinguish between whether the detection correctly identified what it was supposed to and whether what it detected was actually malicious:

  • Did the detection create an alert for what it was supposed to detect?
  • Did the detection create an alert for something malicious?

Consider a detection rule identifying multiple incorrect logins from a new location. If the rule incorrectly checked whether the location was new, it would fail to detect what it was supposed to. Even if the rule was written correctly though such activity may or may not be indicative of malicious activity. Within this article we will refer to the later definition, i.e. is something malicious, but both should be considered when evaluating your detections. Within the context of detecting maliciousness, we can define False Positive, True Positive, True Negative and False Negative:

  • True Positive: An alert fired for actual malicious activity
  • False Positive: An alert fired for non-malicious activity
  • True Negative: An alert did not fire for non-malicious activity
  • False Negative: An alert did not fire for malicious activity

These four data points are useful for calculating many of the metrics we will discuss. Unfortunately, you may need to look across multiple data sources to identify this information. Analysts can disposition alerts as a true or false positive within the native security device, or a centralized event and alert system, such as a SIEM. The organization may also have a ticket management, SOAR or case management system specifically used for managing the team’s processes. Even if such a system exists, you should check that the internal processes of the SOC align to your expectations. Some SOCs may only create a ticket for alerts they choose to investigate while others may choose not to ingest all alerts within a centralized system even if one exists. These metrics support the overall SOC management so it may be possible to convince leadership to alter processes in support of more consistent metrics gathering.

False negatives are more difficult to identify. In a future blog, we will look at a metric called Detection Drift, which helps automatically identify false negatives for a set of detections related to similar activity, such as a specific technique. An organization can also identify false negatives manually by forensically analyzing an attack to identify earlier stages or by reviewing artifacts that identify future techniques of an attack. This information can be compared against the rules designed to detect these portions of an attack. For example, an analyst could execute malware in a sandbox that was found in a phishing attack to determine if the C2 protocol would have been detected. A false negative would be identified if detections designed to alert on that activity did not fire during the test.

A key metric we would love to understand is the accuracy of our detections. Unfortunately, we cannot easily calculate it. In statistics, accuracy is defined as the number of correct predictions / total number of predictions:

Reviewing the equation, we can see it is not practically possible to calculate the total accuracy of a detection program. In doing so it requires calculating True Negatives i.e., the number of times that we did not fire an alert for non-malicious activity. We can though calculate a detection’s Precision. Precision answers the question, “What proportion of alerts fired by a detection was for something actually malicious?

Precision is the true positive rate for our detections defined as:

The opposite of this, is noisiness or the false positive rate, which answers the question, “What proportion of alerts created for this detection were not malicious?”

Recall answers the question; “What proportion of malicious activity was detected by the detection that it was supposed to detect?

False negatives are defined as malicious activity that did not fire an alert. Like MTTD, this is a historical view of the problem. We can identify false negatives through forensic analysis if an attack was detected at a different stage or if we layer detections for the same activity, such as network and endpoint devices using separate telemetry to identify the same thing. Usually as we attempt to improve our Recall, i.e. detect more variations of an attack, we will also increase the false positive rate, resulting in more alerts for our SOC analysts to review. There is a tradeoff between potentially identifying more badness and wearing down our SOC analysts with fatigue of reviewing false positives. If a detection is too noisy we may find that our SOC is auto-dispositioning alerts from that detection because the noisiness is too high.

Plotting Detection Performance:

ROC (short for receiver operating characteristic) Curves are a common data analytics technique that can be a useful construct in evaluating changes in your detection approach for a procedure or technique. At its core, a ROC curve plots the True Positive to False Positive values for different thresholds in machine learning or other type of analytics. In our case, we can use them to plot how changes we make affect a detection’s performance. A standard ROC curve looks something like the one below. On the X axis we plot the False Positive values and on the Y axis the True Positive values. As we change the threshold of the analytic, more stuff will be caught, increasing the true positives but usually the noisiness increases as well, increasing the false positives.

Sample ROC Curve

Adapting ROC Curves to our use case, we can evaluate multiple variations of a detection or overlapping detections of the same activity to identify whether the changes provide sufficient performance improvements to justify additional work by the SOC analysts to triage false positives. Consider an example where we have four approaches to a detection. Each approach is adding conditions to catch additional variations of the attack that we wish to detect (i.e. increasing the threshold).

  • Approach 1 (A1): Condition 1
  • Approach 2 (A2): Condition 1, Condition 2
  • Approach 3 (A3): Condition 1, Condition 2, Condition 3
  • Approach 4 (A4): Condition 1, Condition 2, Condition 3, Condition 4

During our testing phase we would like to determine which of these approaches should be deployed within our environment. We retrieve the true positive and false positives after each test. Any tests against data identified specifically for this detection would be categorized as a true positive detection. Tests against our known good dataset, either previously curated or from live environment tests that resulted in a false detection would be tagged as a false positive. By plotting their True Positive and False Positive values, as shown in the figure below, we can create a ROC curve to help our analysis.

ROC Curve for the data points above

Looking at the graph, we can see that our initial detection had few false positives but also was insufficient in detecting a decent portion of the malicious activity. Adding in rules for conditions 2 and 3 greatly improved our true positive rate while increasing the false positive rate as well. The fourth approach provided a minimal increase in detection while greatly increasing the noise. As a detection engineer, we might decide that Approach 4 did not have sufficient value over Approach 3 to justify the false positives created.

ROC curves are one way to leverage True Positives and False Positives to understand how changes in our detections affect the precision and noisiness of our detections. In a follow-on blog we will discuss how tagging your detections can be used to identify when adversaries have made changes to evade detections, i.e. identifying False Negatives. In the next blog though we will take a step back to look at how we can determine the coverage of individual detections. Detection coverage is a key metric to measure the value of a detection engineering program and in my view is the primary way to look at its effectiveness.

The third blog in this series covers detection coverage metrics and can be viewed here

References:

If you would like to dive more into these metrics, google’s machine learning crash course is a great place to start and was used as a reference for this blog.

https://developers.google.com/machine-learning/crash-course

--

--

Gary Katz

A mix of software architecture, cyber security, coffee and cocktails