Part 2: Would a Detection by Any Other Name Detect as Well?

In part 1 of this blog post, I described some of the limitations and nuances of ATT&CK Evaluations, as well as how we settled on the nine categories that we used to articulate detections. Given the complexity of the describing detections, this blog post will highlight some of the rationale around the detection categories to hopefully remove some of the confusion.

As we created detection categories, we divided them into two types: “main” categories, which would describe the detection overall, and “modifier” categories, which provide additional information to supplement the “main” category.

Round 1 Detection Categories

Detection Categories: Main

One of the challenges we encountered when choosing/delineating main categories was that analytics vary in terms of logic, as well as amount of information provided to the analyst. Both of these dimensions are important, but in general, our focus was on the information presented to the analyst rather than the detection logic. We took this approach because different vendors revealed different levels of underlying logic behind detections. Some tools provide their analytic logic, but others only provide high-level descriptions within the user interface. To ensure we could evaluate vendors consistently, we chose to focus on what data was presented to the analyst.

The amount of information provided to the user was critical for how we assigned categories to detections. Without enough explanation, we could not determine if a detection applied to the technique being tested, or what detection category was applicable. For example, one question we’ve received is about the difference between the Enrichment and Specific Behavior categories. If the rule appeared to be a simple mapping — that is: “I saw command X, which maps to ATT&CK technique Y” — we considered it to be Enrichment. To obtain a Specific Behavior in addition to looking for a “technique-level” description, we also looked for a description that provided additional context to the event, as well as some visual indication that the event was noteworthy.

Another distinction that may cause confusion is the difference between General Behavior and Specific Behavior. General Behaviors require similar complex logic to Specific Behaviors, but they provide a more generic description of the activity, such as the potential intent/goal of the adversary actions as described by ATT&CK Tactics or other high-level behavior that didn’t relate to a specific technique.

Detection Categories: Modifiers

First and foremost, modifiers are not inherently bad. We realize the words we chose may have negative connotations, but it is important to read the category definitions and examine the detection details to know if the modifier makes the detection more or less useful to you. Since the default perception of modifiers appears to be negative, we want to provide reasoning for our choices and some of the benefits they convey.

Configuration Change: Endpoint security products are often configurable by design. Data, rules, and sensitivity are some of many configurable elements. Even though the vendor set the configuration of their tool for the evaluation, it is expected that, with advance knowledge of test specifics, they could have configured their solution to detect more procedures during the evaluation. For example, during an evaluation a vendor could identify a potential way of detecting a behavior and create a new rule. Even if it was something that would be default in their product going forward, it wasn’t part of the original configuration when we began the evaluation, so we would note it as configuration change. The Configuration Change category is not meant to convey whether or not a newly available rule isn’t useful or easily accessible to their customers — that’s up for the vendor and their customers to decide.

Delayed: Upon initial consideration, it’s understandable to think that getting a real-time alert is better than getting it at some point in the future. Time is incredibly important in reducing the impact of the adversary. But what if the logic requires more activity, and thus time to pass before an alert can be produced? For example, consider an analytic for SMB Copy and Execution. This analytic was originally created to reduce false positives of SMB copy events. In this case, the alert for SMB Copy and Execution would not have fired without the latter event of process execution. In this case, the delay improves the detection fidelity because you get additional context with less noise than an SMB Copy alert alone. The same is true for Delayed categories in our evaluation results. Again, you would need to review the detection details and follow up with the vendor to determine if a Delayed detection is useful for your organization.

Tainted: For this category in particular, we understand that the word “tainted” can have negative connotations. In our context, we chose this word because it best expressed the “trickle down” relationship we wanted to convey. For Tainted detections, the detection did not rely on a previous alert, but rather a previous alert provided the analyst with additional context based on its relationship to the new detection. A Tainted Specific Behavior would have still been a Specific Behavior if the Tainted relationship not have existed. A process tree by itself does not result in a Tainted modifier. Rather, we look for visual evidence in the tool that would allow an analyst to realize the relationship to a prior alert would cause the new detection to be suspicious.

For example, if a tool can identify ports of network connections, it would receive a Telemetry category for T1043 Commonly Used Port. If there was an alert associated with the PowerShell process that created the network connection and that relationship is easily evident to an analyst through the product’s interface, then the network connection inherits the suspicion placed on the PowerShell process and would be a Tainted Telemetry detection. By seeing the relationship to the previous suspicious alert, the analyst can automatically assume a level of suspicion beyond “fact-of” information provided by the isolated event. This additional information provides useful context to that event and the overall incident that is being analyzed.

To MSSP or Not to MSSP?

From the start, we made a concerted effort to avoid evaluating the analysts using the capabilities, as we wanted to assess the capabilities themselves. This motivated our “open-book” test methodology, where we said what the emulated adversary did, how we did it, and then asked the vendor how it was detected by their tool. We were agnostic about which analyst shared the detection with us or how they provided the detection, since we focused on whether the capability could detect the procedure. Given that focus, it was a deliberate choice to allow Managed Security Service Providers (MSSPs) to participate in the evaluations. We allowed each vendor to use whatever capabilities they saw fit for the evaluation (within our infrastructure limitations), and this included MSSPs.

The question then became how we would process the content provided by the MSSPs in a way that would be fair to vendors without a MSSP. The approach we took was to treat them as if they were any other black-box detection capability. Just with every other detection, the onus is on the vendor to demonstrate to our team that they were able to detect the event and that the detection is relevant to the technique under test. Whether the detection came from the analyst in the room, an analyst who was remote, or analysis from machine learning, we required proof of the detection. In some cases, the MSSP may have reinforced events that were immediately available within the tool, and in other cases, they could provide new analysis based on additional human or machine-driven logic available to their analysts. We treated MSSP detections as we would any other detection by reviewing the information from MSSP-provided data, asking follow-up questions of the vendor, assigning applicable categories, and describing the detection using notes and screenshots.

Forensic Analysis

Forensic capabilities can be extremely helpful to an analyst in piecing together what might have happened in an intrusion. Some endpoint tools allow analysts to manually pull additional artifacts such as logs and memory from hosts so they can conduct additional analysis. Since our test was focused on detection of ATT&CK techniques as they were being executed, we considered data that would be manually pulled and analyzed by an analyst to be out of scope for our evaluations. Given their potential value to end users, we explained these capabilities throughout the evaluation results so consumers would be aware of their existence. The distinction between forensic capabilities that were out of scope versus Telemetry was whether the information was inherently available in the capability, or if the analyst had to initiate an additional process to retrieve it.

The Future of ATT&CK Evaluation Detections

We hope these blog posts have helped you gain more insight into how we think about detections for our ATT&CK Evaluations. As we move forward with our evaluations, we will make every attempt to improve how we describe detections in a useful way. This includes refining our definitions for the detection categories that we currently use, as well as adding or removing categories as necessary. We welcome your feedback. Should you have suggestions on how we can improve, please reach out to us via attackevals@mitre.org.

©2019 The MITRE Corporation. ALL RIGHTS RESERVED. Approved for public release. Distribution unlimited 18–03621–7