One of the major takeaways from our first round of ATT&CK Evaluations is that it’s tough to describe detections by endpoint security capabilities. This is even more challenging when we need to describe detections in a useful way for people who are not intimately familiar with the capabilities being tested. Add in the fact that we want to communicate capabilities in a way that captures the uniqueness of each, and you will begin to understand the challenges we faced in preparing our results for release. In this two-part post, we will clarify how and why we settled on our descriptions for the detections, some limitations and nuances of our approach, and how we applied these nine categories.
The Complexities of Comparison
While we hope our results are useful to evaluate vendors and know comparison is natural, we encourage you to consider factors beyond the detection categories as you do this. Let’s address a major point up front: one detection category is not necessarily “better” than other categories. While detection categories and descriptions might lead one to think that certain categories are better, the category alone is not enough to give a complete picture of the detection. It’s important to look at the technique under test, the detection details, and what’s considered normal behavior in your organization’s environment to help you understand what detections are most useful to you.
As a simple example, let’s explore PowerShell. Should a capability alert every time an encoded PowerShell runs? If your organization uses PowerShell for system administration, receiving a Specific Behavior or General Behavior alert on encoded PowerShell might not be meaningful to you and could overwhelm your analysts. However, perhaps the fact that Telemetry was available to show that a PowerShell process was run in the context of other potentially bad behavior would be useful to you instead of alerts. On the other hand, maybe your organization has PowerShell execution restricted, and its execution is something your analysts would want to receive alerts on. In this case, a Specific Behavior or General Behavior detection may be preferable to your organization.
The vendor’s approach also matters when considering detection categories. Some vendors take a more preventative approach and choose to alert and block on initial execution or other egregious events (e.g., credential dumping), and then provide supplemental data to enable additional analysis. Other tools may focus less on protection and more on providing data to enable the analyst to hunt or respond to threats. The different vendor approach is sometimes expressed using terms like endpoint detection and response (EDR) or endpoint protection platform (EPP). We made the conscious effort to not consider “market segment” for our evaluations since we focus on detection of ATT&CK techniques regardless of the product type. That said, we encourage consumers to remember that products have different focuses and strengths, and therefore focus their implementation on different detection categories.
All this isn’t to say you shouldn’t compare one vendor’s results against another. What it does mean is that you have to explore these detections with additional considerations in mind. The goal of the results is to highlight how each product can uniquely detect ATT&CK techniques, and to do that in a transparent way — both in terms of test methodology and results. We want to enable people to know how to use their products better, motivate improvement in the post-exploit detection market, and provide a basis for people to build from in order to make informed choices on the products to use in their environment.
Detection Categories: The Origin Story
So how did we arrive at these detection categories? A common way of describing defensive capabilities related to ATT&CK is a color-coded ATT&CK matrix using green, yellow, and red in a stoplight chart. You color a cell green if you can detect that technique “out of the box,” yellow if you could detect with some additional effort, and red if you can’t detect it. This is great for summarizing capabilities in a simple way, which is why we often brief this approach.
But when trying to provide a detailed analysis of capabilities, this approach falls short, as others in the community have rightly pointed out. What does a “green” cell mean? Does it mean you can detect all the ways that ATT&CK technique could possibly be performed? Or does it mean you can detect it the one way it was tested? Are all green cells created equal? Should a product seek to be “all green” with fewer detections per technique, or detect a subset of ATT&CK techniques with more complex detections? These are just a few of the questions that reveal the limitations of the stoplight approach.
We found that the stoplight chart is often too abstract for detailed analysis because it can easily cause people to come to the wrong conclusion about how sufficiently a technique is detected. So we wanted to avoid it for our evaluation results. Alternatively, including a description of the detection may allow a user of a tool to gain useful insights into how a tool performs detections. However, we realized that with no abstraction at all, prospective users would have difficulty understanding the differences in tool approaches and it would reduce the broad usability of the results.
To identify that right level of abstraction, we used our past experiences and looked at what others had done. We drew upon our collective experiences in SOCs, post-exploit detection research, adversary emulation, and ATT&CK. We also looked at what others in the community were thinking and the approach vendors took. At the 2018 SANS Blue Team Summit, John Hubbard provided a 7-tiered approach to quantifying detection maturity. Roberto Rodriguez has done extensive research on ATT&CK detections and scoring their effectiveness.
We decided that our goals of articulating evaluation results required a new approach to describe detections. Our initial approach was heavily influenced by our experience in SOCs. We considered what we had observed in tools as well as what data we found useful as network defenders. We started with broad categories that describe detections: General Behavior, Specific Behavior, Indicators of Compromise, Telemetry, and None. As our understanding matured, we recognized additional gaps, which we filled with additional categories: Enrichment, Delayed, Configuration Change, and Tainted. Definitions of each of our detection categories can be found here.
This is where we stand today. We recognize that these categories are not perfect, and we will continue to refine them. That said, they provide a high-level classification of different types of detection. The categories allow us to communicate these different types of detection in a common way across different products. They provide more detail than simply saying a vendor alerted, could have detected, or missed, and we feel that detail is important to help users understand the data we provide with the evaluations as well as and the capabilities the data represents.
In this first part of this blog post, we’ve explained some nuances of comparing vendors, including the potential trappings of just looking at detection categories to do so. We described why, despite the risk of over-simplistic comparisons, we took the “category” approach to articulating detections. In the next part of this post, we’ll share more details to clarify how we view the categories we chose.
©2019 The MITRE Corporation. ALL RIGHTS RESERVED. Approved for public release. Distribution unlimited 18–03621–7