Quantifying Detection Coverage with Validation

Series: Measuring the effectiveness of a detection engineering program

Gary Katz
10 min readJul 11, 2023

This blog series is based upon excerpts from a book I am writing with Megan Roddie and Jason Deyalsingh.

In our first blog in the series we broke down techniques we are creating detections for into three categories, low impact, medium to high impact and high impact choke points. The approaches described in this article are designed to be implemented for last of these, i.e. those which the detection engineer has investigated all the known approaches the adversary can take to implement a technique and determined how to detect them.

The purpose of validation is to prove that your detections will identify the adversary. It is not to prove how well you can detect the validation tool. The topic gets a little meta, so apologies in advance. We’re going to attempt to keep it grounded and focused on how we can practically define and understand the detection coverage for a technique to make sure your validations are providing metrics that accurately quantify the expected performance of your detections.

To start we need to talk about the durability of a detection, which identifies how long we expect a detection to be effective. If a detection is identifying an indicator on the lower echelons of the pyramid of pain, or anywhere there is an almost infinite number of variations that the adversary can change to execute the attack, the durability is limited to the ops-tempo of the adversary changing that part of the attack. As an example, if your detection is limited to an IP address or domain, there is almost unlimited variation to the IPs or domains used by the adversary. The durability is thereby associated with the how often the adversary changes this infrastructure. Similarly, if we look at the open-source tool Remcom , which can be used by an attacker to execute programs on remote systems, we notice that the tool creates the log file, ‘RemComSvc_Logs.log’. This may seem like a good indicator. It is fairly unique in name and is a constant artifact of the tool. Unfortunately, since the file is created by an open-source tool, any adversary can easily change the filename (or not produce a log file entirely). The durability of the detection is limited only by the laziness of the adversary. Our goal therefore is to create detections that are based upon a set of features (indicators) that have a definable set of variation and that our detection encompasses that definition.

Ideally, we want to identify, not just a feature with a defined set of variability, but the feature with the least amount of variability. If we were to think about this on an X/Y graph we have a set of indicators which can be used to identify a procedure on one axis, and on the other axis the variability of each indicator. The indicator with the least amount of variability is the choke point for that procedure. It is the thing that the adversary can’t get around doing and yet has a well-defined way for us to identify the artifact or action as being associated with the malicious activity. Our goal therefore is to write a detection around that indicator’s artifact, or artifacts, to accurately detect the malicious activity. In real scenarios, we may compose a detection of multiple indicators and an indicator may have infinite variability.

Chart identifying choke points using atomic indicators

A procedure is a set of steps that the adversary takes to implement a technique. We can capture telemetry about that procedure at varying levels of fidelity. This is similar to describing a user’s interaction with a website. We could capture a video of the user clicking on buttons. We could capture the button clicks as events, or the network traffic between the website server and the web browser. We could look at the API calls made between the web browser and the server or look at the functions called within the web browser itself. Depending on what procedures we were looking at tracking, different types of telemetry would be useful in tracking that activity. The same is true for the telemetry and indicators used to identify malicious procedures.

This means the choke point may not be consistent across the procedures used to accomplish a technique. During the detection development’s discovery phase, we identify the multiple procedures, and their choke points, that could be used to achieve the technique. This information defines our detection space. How much can the adversary vary their attack across those choke point parameters and how close can our detections be defined to include that variation without being so broad as to include an unacceptable rate of false positives.

In order for our validation tests to provide accurate coverage metrics of a procedure they must consider what we are detecting against. The variance of the tests should occur at the same level of the detection’s indicators. A misalignment can result in validation tests showing greater detection coverage than is in place. As an extreme example, consider a tool which can implement three separate procedures for performing the same attack. A single detection could be written to detect the tool. The underlying implementation of the procedures could vary heavily, and yet the validation tests would show complete coverage.

As a real-world example consider a SOC which actually performed this approach to skew the validation results of a red team. The SOC successfully identified the red team by researching the red team’s C2 prior to the test and creating a detection for that C2. The red team was thwarted at every turn. The SOC detected whatever was thrown at it. If we looked at the validation results, the SOC’s detection capability against the techniques attempts by the red team was perfect. They detected everything.

The SOC had correctly identified a high-fidelity indicator of the red team, a C2 which they were not updating. From a black box validation test perspective, the red team could attempt any procedure they wished and the SOC’s detections would be successful in identifying the attack. Their actual detection coverage though was completely unknown, despite these results. This is because the variability of change performed by the red team was at the technique, procedure and tool level. The detection logic performed by the SOC was at the network C2 level. While the SOC passed the red team test with flying colors, the results showed nothing about their ability to detect an actual adversarial attack.

We could possibly claim that only the C2 technique was discovered and not the individual procedures executed on the infrastructure. While technically true, the point can be re-emphasized by including a hypothetical network traffic decoder providing a list of commands executed by the red team. The impact is the same, just with greater fidelity in detecting the red team’s actions.

We can use our detections and the attack space to identify what we need to validate. Our validation tests should ideally encompass the entire attack space of a technique. The attack space can be defined by the procedures that can be executed to achieve that technique. The variance in a procedure can be defined by how much the values of the set of indicators used by the detection can change while still achieving the objective of the technique. Therefore, we only need a single validation test using a procedure not covered by our detections to identify that procedure-level gap in coverage, while we need a set of validation tests that vary across the allowable values of the indicator(s) to confirm our detection will identify all variations of an individual procedure covered by our detection rules.

While a single rule can be written to detect multiple procedures (or vice-versa), we will equate one detection to one procedure to simplify our explanations. The approach though holds true under either circumstance.

Example of one rule detecting multiple procedures of a technique

These definitions allow us to identify four sets of procedures that can exist when validating a technique.

By definition, if a detection fails to identify a validation test, the test must either be testing a procedure that is uncovered by our detections or is testing a variation of the procedure which our detection incorrectly does not cover. This means that third party (black-box) validation will identify missed coverage for any procedure they test but we do not have a detection for. There is no guarantee third party validation would identify partial coverage (unless the validation suite is varying their tests along the same parameters at which your detections are working).

Validation can therefore be approached with the following processes:

1. Validate if detections exist for at least one variation of a procedure:

a. Unit Tests: Map the procedures identified during the Discovery phase to existing validation tests or create new validation tests for each procedure.

b. Third Party Validation: Leverage third party validation to either implement any of the above unit tests or to identify procedures not identified by research. Third party validation tests that do not result in a successful detection would be indicative of either partial detection coverage for a procedure or missed coverage.

2. Validate if procedures for a detection are fully covered:

a. Edge Cases: Test the edge cases of the procedures by altering the value of the indicator artifacts across a representative set of allowable values that would still achieve the adversary’s goals. If the technique can be achieved without resulting in that artifact being created or changed, that is by definition, a different procedure and should be captured by its own detection.

b. Peer Review: If it is not practical to write validation tests across the allowable values, a peer review system can be used by documenting, based upon research, how the adversary could alter their attack and having another detection engineer check that the detection would encompass the possible adversary evasions.

There are some limitations around this approach which can impact a detection engineering team’s ability to fully implement the above processes. As noted at the start of this blog, the goal of these processes is to help quantify your detection coverage vs. necessarily resulting in complete coverage for a technique. Below are some common roadblocks that teams may encounter.

1. Requires mature processes and resources: Implementing a validation process like the one above requires a large enough team to support both detection creation and validation. If a team is simply attempting to put in place a base level of detection, it would be impractical for them to have the resources to fully investigate a technique and iterate through a detection and validation workflow. Even with a reasonably large detection engineering team it is still necessary to choose where to focus resources. Every technique cannot be fully investigated and validated.

2. False Positive Acceptance: Fully encompassing a procedure may result in an unacceptable number of false positives, especially if implementing behavioral detections which overlap with acceptable actions from network administrators or other users within your environment. Therefore, while it may be possible to fully validate that all variations of a technique are detected, the detections may result in unacceptable nosiness.

3. Identifiable Chokepoints: There is an implicit assumption that a technique has a set of choke points which can be identified to constrain how an attack can be implemented. There is no guarantee this is the case. It may take a significant amount of time and skill during the discovery phase to perform this analysis. In identifying a definable scope for the chokepoint(s) we may increase the number of procedures to an unacceptable number.

4. Limited set of procedures: The approach assumes that the ways to execute the technique can be captured in a ‘reasonable’ set of procedures. Similar to the issue with chokepoints, if the detection engineer is unable to group the variations of a technique into a limited scope, they would be unable to create validation rules to encompass those procedures.

5. Available Telemetry: The approach assumes that once the procedures and chokepoints are identified there is sufficient telemetry to provide the necessary visibility for writing detections to identify the choke points.

Let’s compare this coverage metric to the limitations of Mean Time to Detect or efficiency metrics, such as the number of detections created we discussed in the first article. MTTD suffers from the following issues.

1. The metric provides a historical view of performance which may not be indicative of the SOC’s ability to respond to future attacks.

2. The metric only includes the attacks that have been detected, if the adversary is completely successful the attacks are not included.

3. The parameters and number of data points which define MTTD are created by the adversaries that have historically attacked the network. They do not reflect the potential variability of attacks.

4. The metric does not reflect the work the organization is continuously performing to prevent future attacks.

In contrast, the validation metrics discussed above do not suffer from the same issues. The validation metrics are not based upon historical performance or the adversary. Instead, the validation metrics are based upon the investigative research performed by the detection engineering team to understand how an attack can occur. Rather than representing how well the SOC succeeded at detecting the specific procedures used by the adversary in a specific attack, they represent how well the SOC will succeed at detecting all known variations of that technique. Since the coverage metrics are based upon testing performed by the SOC, verses attacks by the adversary, they create a meaningful way to track a team’s progress and inform leadership of the continuous value the team is providing.

In the next article, Tracking Detection Drift, we look at a way to identify when all those amazing detections used to detect a technique or specific procedures start to fray and the adversary begins to find ways around them. How can we automatically identify our detections becoming less effective?

--

--

Gary Katz

A mix of software architecture, cyber security, coffee and cocktails