Quantifying the MITRE ATT&CK Round 2 Evaluation

Jonathan Ticknor
security analytics
Published in
8 min readApr 23, 2020

MITRE released the results for Round 2 of their EDR evaluation scenario, this time emulating APT29. As you might have seen, nearly every vendor associated with the evaluation has issued a press release pronouncing their clear effectiveness and decisive victory over the competition. I want to avoid the marketing fluff and jump right into the data. What follows is an explanation of how I quantified the results, with layers of nuance that I hope will help customers find the right fit for their situation. Rather than provide a one-size fits all scoring methodology, I broke down results with clear lines of separation between human derived detection and machine only. If you’re trying to better understand the market or looking to make a choice for a new EPP tool, what follows should be especially relevant to you. I’ve provided the Github link to code and scoring files at the bottom. As always, I appreciate feedback that can help improve the methodology.

The views expressed in this analysis are mine alone and do not represent those of my employer, which partners with many of the vendors who participated in the MITRE evaluation.

Evaluation Basics

Let’s take a quick look at the evaluation description before we jump into the results (please read the full evaluation information here). For Round 2 of the evaluation, MITRE focused specifically on APT29 emulation. For those not familiar, this group is attributed to the Russian government and was believed to be behind the DNC attack during the 2016 election cycle. To assess detection capabilities of 21 EPP vendors, MITRE has broken down the results using two key indicators: detection types and detection modifiers. I think it’s appropriate to outline these types and modifiers because they have changed since the first round of evaluations.

Vendors that participated in Round 2 evaluation

Detection type is broken into 6 categories: Technique, Tactic, General, MSSP, Telemetry, and None. The categorical distinctions are based on quality of the result, e.g. a specific alert notifying the analyst of exactly what happened and how it happened vs. raw telemetry that requires manual hunting. I defer to the MITRE page which provides a more in-depth explanation of these categories.

Additionally, modifiers were included to provide context around the detection (i.e. was it an alert, delayed due to processing, a result of human investigation, etc.). MITRE provides 5 modifiers: Alert, Correlated, Delayed, Host Interrogation, Residual Artifact, and Configuration Change (I exclude innovation, it’s not clear to me what it really means yet).

Scoring Methodology

As the press releases poured in, with quotes like “the most detections of any vendor” and “the highest fidelity results”, I found myself more inclined to publish a scoring framework that could provide some necessary context to customers who don’t follow these evaluations closely. It is bit misleading when detection could mean anything from generic telemetry that requires manual hunting to alerts with specific information presented directly to an analyst in the UI. My first thought, before trying to put numbers on paper, was to break the scoring down into a few distinct categories:

1. Immature SOC: organization with limited security staff, often only have IT staff. No manual hunting and rely heavily on the tool to provide alerts. No MSSP.

2. Mature SOC: organization with multiple levels of SOC analysts, hunting capability (via tools or scripting), other security tools feeding SIEM, ability to investigate and remediate.

3. MSSP: relevant to any maturity level, an attempt to quantify the value add of an MSSP to the base product (usually a non-trivial fee).

Breaking down results into these distinct categories is the most important part of this evaluation. The results of full blown MSSP and manual hunting aren’t particularly relevant to customers who can’t afford those resources, and could ultimately steer them towards an incorrect product fit. Now that we have three scoring regimes, how do we quantify the detections themselves while taking into account modifiers?

The settled on a 0–4 scale scoring scheme based on detection type, with a modifier weighting (between 0 and 1). The more verbose the detection, the higher the initial score. Weighting is influenced by the speed with which the result gets to the analyst (e.g. immediate alert vs. delayed by cloud processing). Each of the 3 regimes above carried a different scoring key (see Github link below for full details).

The final score for each vendor is computed by multiplying the detection score and the modifier weight for each of the 140 substeps and adding them together. Instances where there are more than one modifier, the modifier with the lowest weight is used (e.g. an ALERT that is DELAYED due to processing receives the DELAYED weight). At this point there is no consideration for multiple detections of a substep, although I believe this may indicate a higher probability of positive result in a different test (just a hunch, no quantitative justification at the moment).

Results

Immature SOC

This result can best be described as the EPP tool alone, without the support of real human analysis. In this instance, Telemetry is worth zero points since human intervention to detect adversarial behavior is realistically not occurring. Additionally, the manual modifiers are given a value of zero to reflect the lack of human intervention. I think these results are very important for small organizations who don’t plan on purchasing managed hunt or MSSP services and have limited cycles to browse EDR logs. The top vendors in this category reflect those who had the most Technique and Tactic detections which are critical for organizations that need out of the box triage automation.

Scoring:

Detection: {Technique: 4, Tactic: 4, General: 2, Telemetry: 0, None: 0}

Modifier: {Alert: 1.0, No Modifier: 1.0, Correlated: 0.75, Delayed (Processing): 0.5, all others: 0}

Results for immature SOC scoring framework (score reflected below logo)

Mature SOC

This result reflects instances where a more mature security posture is in place, e.g. multi-tiered SOC analysts, IR team, log scale log aggregation, etc. Since human analysis is an advantage for these organizations, points are awarded for Telemetry detections. Additionally, modifiers that reflect human intervention (i.e. Delayed-Manual, Host Interrogation, and Residual Artifact) are now given a non-zero value to reflect their availability. I have purposefully left out the MSSP results as they are not necessarily indicative of an internal SOC capability. It’s interesting to note that there are a few vendors who make rather significant jumps once Telemetry, Host Interrogation, and Residual Artifact results are added (e.g. Carbon Black, SecureWorks). If you have a SOC or hunt team, these results better reflect what a human-machine tandem could hope to detect.

Scoring:

Detection: {Technique: 4, Tactic: 4, General: 2, Telemetry: 1, None: 0}

Modifier: {Alert: 1.0, No Modifier: 1.0, Correlated: 0.75, Delayed (Processing): 0.5, Delayed (Manual): 0.75, Host Interrogation: 0.75, Residual Artifact: 0.25, all others: 0}

* Host Interrogation and Residual Artifact are attached to the detection type None, so a default detection score of 3 is used.

Results for mature SOC scoring framework (score reflected below logo)

MSSP

For those considering whether to purchase an MSSP or managed hunt license, these results should help provide some justification. The scoring configuration matches the mature SOC with the exception of the MSSP detection type receiving a score of 4. The most interesting results in this category were significant score jumps by SentinelOne (MSSP was able to detect nearly all of the previous Telemetry finds) and Microsoft (55 MSSP finds, high Telemetry conversion rate). If you are considering an MSSP purchase, looking at the conversion rate of Telemetry detections to MSSP detections gives you a plausible method for evaluating human capability. I’m not suggesting you can make significant differentiation decisions between 35 and 50 MSSP detections, but 50 vs. 5 is quite telling.

Scoring:

Detection: {Technique: 4, Tactic: 4, MSSP: 4, General: 2, Telemetry: 1, None: 0}

Modifier: {Alert: 1.0, No Modifier: 1.0, Correlated: 0.75, Delayed (Processing): 0.5, Delayed (Manual): 0.75, Host Interrogation: 0.75, Residual Artifact: 0.25, all others: 0}

* Host Interrogation and Residual Artifact are attached to the detection type None, so a default detection score of 3 is used.

Results for MSSP scoring framework (score reflected below logo)

Final Thoughts

The level of participation in this evaluation suggests that the market is moving away from secretive testing and dubious detection metrics to a more open and transparent process for evaluation. I think each of the vendors should be applauded for taking the time to participate and be scrutinized. I want to be very clear, the results of this single test don’t make or break a solution and the vendor orders above are not an end all be all for making a decision. This (and the first evaluation) are simply data points to better inform purchasing decisions. Efficacy, integration, UI capability, ease of deployment, and price are all important factors that can’t be considered independently.

I do find it interesting that there are a handful of vendors that appear to rise to the top regardless of deployment scenario (product only, MSSP, advanced SOC). I will point out that in the MSSP category, the difference between rank 9 and 13 is only 50 points, but the difference between the top vendors and the median is quite notable. If you dig into the data, it’s clear that the MSSP offerings from these vendors were quite adept at using their telemetry to create findings, suggesting strong hunt teams. I hope for the next round of evaluations that some of the managed hunt providers (Red Canary, Root9B, etc.) can find a way in with their own/favorite EPP agent to better quantify what the managed hunt market is capable of.

I’ve taken a very specific scoring approach to this evaluation and thus will have gaps in my interpretation and complexity of evaluation. For instance, two major factors that I would like to support are (1) multiple detections on a single substep (may indicate more robust solution), (2) attack stage of the substep (early detection vs. detection after full scale compromise). However, I think this framework helps simplify a mountain of data that only the most hardened of us will dig through. My hope is that those trying to decide on an endpoint, MSSP solution, or security stack have a way to inform their decision a little more beyond pure marketing. Vendors can slice the data any way they like to sell a tool, my hope is to make the playing field a little more level.

Links

MITRE ATT&CK Evaluation: https://attackevals.mitre.org/APT29/

Github Code Link: https://github.com/jonticknor/maes

--

--