Making Sense of ATT&CK Evaluations Data: Who Really Won and How to Avoid Common Pitfalls

Frank Duff
Published in
7 min readJul 15, 2021


The first question we get from anyone who is hearing about ATT&CK Evaluations for the first time is “who won?” It doesn’t take much to understand why people ask. On one hand, it is the million-dollar question with potentially million-dollar consequences; carrying an answer that guides consumers into confidently making progress towards ensuring their networks and data are best secured. On the other hand, many of the evaluation participants each place their own claim on how the ATT&CK Evaluations show their unique capabilities and successes.

From a vendor perspective, showcasing ATT&CK Evaluation participation and results is understandable and trying to explain how they are “better” may be inevitable. For most vendors, they aren’t just participating in ATT&CK Evaluations, they have embraced behavior-based detections and/or ATT&CK as an organization. It has informed their research and development roadmap and has been deeply integrated into their product. The ATT&CK Evaluations show all that hard work and investment, which the participants should be proud of and highlight. They talk about how great they performed, using evaluations data to back it up with some measurable stats to help justify their claims.

Unfortunately, while evaluations data should be informative and aid in decision making for the community, it becomes difficult to differentiate the stories when many participants are sharing a slightly different interpretation of the results. If you don’t have a deep understanding of ATT&CK Evaluations — the strengths, limitations, lexicon, methodology, and all the nuance that ATT&CK brings, the data can be confusing if not frustrating.

We unfortunately cannot provide you with a direct answer to what solution you should buy, however hopefully this post will provide some context around the results that will allow you to jump in and make sense of the data more effectively, make sense of all the marketing around ATT&CK Evaluations, and enable you to make these decisions for yourself.

Understanding the ATT&CK Evaluations Focus

Before you can make sense of the results, or even analysis people are performing on the results, it is important to consider the origin of ATT&CK Evaluations. When ATT&CK Evaluations was originally conceptualized, it was done so with a very specific scope: to provide transparency around the ability of defensive solutions to address the behaviors described in ATT&CK and propel the enterprise security market forward. The Enterprise Evaluations methodology was specifically designed to be data-driven and focus on this very specific topic. There are many other organizations that utilize various other practices and philosophies to address the larger question of what products are “best.” ATT&CK Evaluations can help address some aspects of this question, but there are some key design choices, and resulting limitations, that you should be aware of as you dive into our results:

· We allow the vendor to select the tool(s) that are evaluated, which could be single solution or multiple tools.

· We document but don’t heavily regulate configurations, and vendors may or may not select one that is applicable for you.

· The evaluation is executed as a collaborative purple team, so once we begin the vendor knows what we do and how we do it.

· The environment is a small with little to no user-generated noise, so finding the needle in the haystack will be easier than in practice.

Note: our upcoming ICS Evaluations are similar in many regards, though have their own set of limitations that should be considered, and these will be explored in subsequent posts.

Each of these design choices were made for very specific reasons centered around consistency of execution and objectivity of results, and each introduces their own limitations that can make comparing results from vendor-to-vendor or round-to-round difficult. We focus on the potential capabilities of solutions, with the understanding actual operational results may vary, and try to eliminate biases from the evaluation methodology. In each release, we describe how we evaluated, what was evaluated, and what the results were — leaving interpretation of the data to the reader.

We do not rank or rate vendors. ATT&CK Evaluations enables a quantitative baseline understanding of potential defensive coverage through the lens of ATT&CK. We continue to explore evolutions to our methodology or result presentation to minimize and better articulate the limitations, and we welcome your ideas.

Despite these design choices, and limitations that come with them, ATT&CK Evaluations is a source of deep technical data to understand how solutions can potentially address known adversary behaviors. You can explore whether they can collect the right data to allow you to detect the activity (i.e. telemetry), and whether they have built-in logic (i.e. analytics) to provide context as to what that detection means (e.g. alert descriptions and/or ATT&CK mappings), as well as if they have the opportunity and flexibility to improve/adapt by adding new sensor data or logic (i.e. configuration changes).

A Caution Against Overgeneralization

This year we began releasing statistics around tool performance to summarize the underlying data at a higher level. These are by no means the only metrics you should consider. You need to keep in mind that each organization has their own needs, based on what other capabilities they own, budgets, their number of users/hosts, number of analysts, skill of those analysts, and so many other unique elements. This is the main reason we do not declare a winner or rank solutions. What is right for one user, isn’t necessarily right for another user.

Each metric has its advantages, and limitations, that will lead to different conclusions depending on your organization’s unique needs. For example, you might assume a higher analytic count is better, but a high number might indicate the possibility of flooding your analyst with false positives or potentially redundant alerts. Similarly, analytic coverage can fall symptom to the same effect. Many ATT&CK techniques may not warrant alerting due to noise or the (opportunity) cost of collecting that data. Telemetry coverage can be a good indicator of the potential capabilities of a product, but if you have a small or inexperienced SOC, relying solely on telemetry for detections might be a recipe for disaster. Visibility, a composite between the two previous statistics, faces the same challenges, but also gives a simplified glimpse to the mystical “ATT&CK coverage” metric.

Another point to consider is that for each evaluated technique, we define detection criteria which outlines minimal detection requirements to capture the essence of the ATT&CK technique and procedure under test. Application Layer Protocol and Encrypted Channel are techniques that provide interesting insight into our detection criteria process. You might notice looking across rounds that in previous results, established network sockets and DLL loads might have sufficed for a valid detection of adversary C2 traffic. In Carbanak and FIN7, we instead decided to focus on network analysis (ex: detecting the data in transit) to provide definitive proof of the behaviors. This introduced challenges, as some tools do not readily provide information on how their data was generated, requiring us to rely on additional evidence and conversation to discern the source of this information. Regardless, you may agree or disagree with this change to our criteria, but to this end, the statistics referenced above are still subject to your own interpretation.

For each statistic we provide, we are not implying higher is better (or worse). These numbers are simply intended to give you a quick quantitative summary of the results data so you can have expectations and specific questions in mind before diving into the full results.

Another important consideration when reviewing results is viewing how each solution would fit into your operations. For example, in some environments certain data sources may be collectable and others not. The same alerts that might be useful for some may have be high false positive rates or otherwise lower value for others. A user interface that empowers one to make quick decisions, might not offer the flexibility others want. Your needs define what type of product you should select. While in need of a refresh, I refer readers to look at our How-to guide released last February for understanding how you can use ATT&CK Evaluation results, as well as key limitations, which goes into some of these points in greater detail.

In the coming days, we will be releasing yet another round of ATT&CK Evaluations, this time focused on the new technology domain of industrial control systems (ICS). While there are key differences to the evaluations, similar concepts apply to understanding the results. Remember, if it were easy to accurately declare a winner, we would. Instead, you must consider your needs, what our data says, read between the lines of evaluations data and vendor marketing, and come to the conclusion that makes the most sense for you.

© 2021 MITRE Engenuity. Approved for Public Release. Document number AT0018.



Frank Duff

Frank Duff (@FrankDuff) is the Director of ATT&CK Evaluations for MITRE Engenuity, providing open and transparent evaluation methodologies and results.