ATT&CK Evaluations: Understanding the Newly Released APT29 Results
In late 2019, the ATT&CK Evaluations team evaluated 21 endpoint security vendors using an evaluation methodology based on APT29. Today we’re publicly releasing the results of those evaluations. We hope these results provide insight into how various endpoint security vendors detect behaviors aligned to ATT&CK.
The results comprise a robust dataset of detections, mapped to ATT&CK, that enable users to make informed decisions on what tools meet their needs, as well as how to improve detection capabilities of their current deployments.
The results do not provide a score or ranking of the participants. We do not declare a winner. As we did with our APT3 Evaluations, we present each vendor independently, and articulate their baseline capabilities. To assist with processing results, we have updated our technique comparison tool, adjusted the Joystick data analysis tool, and added a digital version of the Using Results Booklet. We’re also excited to release the APT29 Emulation plan and PowerShell scripts used for our evaluation, so you can run your own ATT&CK Evaluation based on your specific requirements.
We provide more context to the results and methodology in the following sections, so you can hit the ground running with the dataset.
Understanding What Qualifies as a Detection
The first step in our EDR evaluation methodology is identifying the adversary we’ll emulate. We build out the adversary profile through open source threat intelligence and translate observed actions of that adversary into behaviors, using ATT&CK. In some cases, ATT&CK mappings are present, but in many cases, we perform the mapping ourselves. For APT29, we also made a public call for intel to augment the open source intelligence and strengthen the profile.
Our next step is to develop a plan that is representative of the adversary behavior. The goal with an adversary emulation plan is not to exactly replicate the adversary, but to create a red team plan in the spirit of the adversary that matches our goal of evaluating endpoint security products. Our emulations plans go through many iterations and the outcome is a step-by-step playbook we use for execution. As we implement each behavior at the procedural level (i.e. how exactly we will execute the techniques in the spirit of the adversary), we began to analyze the behavior from a defensive analyst perspective to generate baseline detection criteria.
With the plan broken down to the procedure level, we have a broad understanding of how the activity will affect the system from a blue team perspective, enabling us to create a baseline criteria. The baseline criteria are subjective, but represents the line in the sand for detecting our specific execution of each ATT&CK technique
For the APT29 Evaluation, we heavily leveraged PowerShell scripting. Capturing script block logging theoretically supports the detection of a significant amount of red team behaviors at some level. In step 2A — Rapid Collection, there are five sub-steps, or different techniques that we look for. From the red team perspective, a single PowerShell script is run, which looks for a variety of file extensions on the host and compresses them into a file later used for Exfiltration. In 2.A.1 (File and Directory Discovery) and 2.A.2 (Automated Collection) our criteria was to observe PowerShell executing (Get-)ChildItem to locate the files or interest. However, for 2.A.3 (Data from Local System) our criteria were to establish evidence of PowerShell reading files off the user’s system.
If a capability was only able to capture the PowerShell script block, it likely only received detections for 2.A.1 and 2.A.2, and it might have a “None” result for 2.A.3 if they do not capture file reads. We acknowledge that with analysis one could understand everything the script was doing and hence the PowerShell script block could serve as evidence of Data from Local System, but in this instance we wanted to focus on the behavior of interacting with the files rather than the command to execute the interaction. For that reason, a “None” detection category in this scenario means it was not the detection we were looking for in the context of this evaluation.
Because the detection criteria are subjective, for the APT29 results, we included the detection criteria in our final results tables for reference. Detection criteria was provided to vendors at the time of the evaluation, with minor adjustments occurring during review. For example, in 14.B.4, Credential Dumping, we decided command line execution of Mimikatz was more indicative of the tool than the behavior, so telemetry was adjusted to focus on the injection of Mimikatz into the Local Security Authentication Server (lsass.exe).
Understanding the Results
The results are generated over the course of an interactive evaluation where the vendor performs the role of detection analyst and the MITRE team performs the adversary and evaluator. For each step and sub-step, the evaluation starts with MITRE introducing then executing the adversary behavior. The vendor is told exactly what we did, where we did it, when we did it, and what potential detection data we are looking for. During the evaluation, we keep an open mind towards what defines a detection. We are driven by the detection criteria outlined above, but to ensure that we include comprehensive results, we ask that vendors show us all their detections. We walk through each detection, screenshot it, and take notes. We don’t declare valid detections during the evaluation, and instead hold that analysis until our internal review process is complete.
During our review process, we assign detection categories to the results data we deem valid for the given step. These categories were defined prior to the evaluation and are hierarchical. As you move through the categories from none through technique, left to right, the analyst is provided more detail on what happened. This doesn’t imply that every technique executed should aim for a technique detection. Some techniques may only warrant basic telemetry, or even limited visibility based on the vendor’s detection strategy.
Beyond the definitions available on the site, there are some nuances to their applications that are worth mentioning. First, as detection categories are hierarchical, there is some notion of implied detections. For example, many vendors have analytics developed at the technique level for a given behavior. When mapping to ATT&CK, they will list details that align with the technique name, and also the ATT&CK tactics attributed to that technique. In these cases, we did not credit the vendor with a separate technique and tactic main detection category for the same detection. However, if they had multiple detections, we would consider each independently.
Second, we decided to not utilize the Innovative category. Each solution has their own innovative components, some of which might not be easily captured by our methodology. For example, many vendors have impressive intrusion summary views, but we look at detections atomically, based around a single ATT&CK technique. Next, the alert modifier was applied very liberally. Even if something was low sensitivity, we still considered it an alert if it visually stood out or otherwise caught the eye of an analyst. Lastly, MSSPs required human analysis as well as the appropriate proof of detection. If only one of these requirements was met, the MSSP wasn’t included as a detection in our results.
Normalizing Results for Consistency and Objectivity
We assessed 21 vendors, many of which offer multiple user interfaces to explore detections and raw data. Some views are dedicated to alerts, some views are more focused on correlation, some others for raw data access, but all were evaluated with the same detection categories and modifiers. With the number of vendors and differences between their interfaces, consistency and objectiveness were key focus elements woven into our approach. When processing results and assigning categories, we initially considered each tool on its own. We then calibrated each step across all the vendors for each sub-step, comparing and normalizing how detection categories were applied to both similar and unique detections. This guaranteed that each vendor was assessed independently, respecting their differences, while also ensuring consistency across the evaluations.
In each evaluation, not including configuration change re-tests, we performed 20 steps consisting of 140 sub-steps, including step 19 which was omitted after the fact. Each sub-step can have multiple detections. The final results are our interpretations of the data we collected during the evaluation, and also include the feedback vendors provide to our preliminary findings.
Using the Results
Similar to the APT3 Evaluations, our APT29 evaluation results are available in their raw form to end users as JSON files. You can explore each vendor’s results, walk through them through step-by-step, download, and use the raw results. We also added an evaluation summary page for each vendor with Joystick data analysis tool visualizations. The graphics demonstrate the detection category distribution across steps and sub-steps, as well as detection category modifier (ex: alert, correlated, delayed, etc.) distribution for each detection category. These visualizations provide an accessible approach to overall evaluation performance. With this, you can more easily visualize if detections were concentrated on a few critical steps or were distributed across the day. You can also see if the vendor alerts on all their detections, or if they also implement an enrichment strategy (e.g., techniques with no alert modifier).
The Joystick tool was also updated. The tool improvements allow the same types of analysis as shown in the evaluation summary, but also enable you to interact with the data and focus on the detection categories most relevant to you. For example, if you don’t want to leverage a Managed Security Service, simply remove the MSSP category from your graphical analysis.
Performing Your Own Evaluations
To complement the results, which may or may not be entirely applicable for your environment due to tool configuration and network makeup, you can find our methodology here. We provide high level details on what was executed and reference the intelligence we used to support the implementation decision. We also provide links to the adversary emulation plan and scripts employed during the evaluation.
This provides greater context to our results — for example, why a vendor might get a detection for one credential dump, but not another. The emulation plan also allows you to replicate the evaluation under your own constraints for customized results. Do you want to see how PowerShell Script Block Logging performs in your environment? How about file reads? Want to include protections? How about evaluating your entire security stack? The emulation plan will help you do all these things. If human red teaming isn’t conducive to your environment, we also updated the Do It Yourself Evaluations with CALDERA to include an APT29 ATT&CK Evaluation profile.
The Future of Evaluations
We are in the Call for Participation for the next round of ATT&CK Evaluations, based on Carbanak and FIN7. Vendors can sign up by May 29, 2020 to be included in next round. This upcoming round of ATT&CK Evaluations will be performed by MITRE Engenuity, MITRE’s new tech foundation for public good. We will be providing more details in the future, but we’ll be focusing on making our results consumable and rapidly available to the public. If you’re interested in participating in this evaluation, or would like to offer feedback so we can improve the content, please contact us.
©2020 The MITRE Corporation. ALL RIGHTS RESERVED. Approved for public release. Distribution unlimited 19–03607–5.