Managed Services Evaluations — Round 2 (2023): Attribution and Speed and Efficiency, Oh My!

14 min readJun 7, 2023

We are excited to announce some innovative updates to MITRE Engenuity’s upcoming ATT&CK® Evaluations: Managed Services — Round 2 (2023). For the first full cycle ever, the Evaluations team followed a product management lifecycle process that has allowed us to maintain our foundation, while iterating on areas we learned needed improvement. Immediately after our OilRig release on November 9, 2022, we embarked on a structured “Feedback Gathering” stage, which was dedicated to listening to our end-user and vendor communities. This feedback was incorporated into our strategy and validated with deep dive sessions with whomever reached out to us. This release also serves as our roadmap update, so strap in, there is a lot to cover.

Components of this Release Announcement

First, let us cover the individual components of this release:

This article will be broken down into the following sections which will cover these key aspects of our upcoming Evaluation.

Components of this Release Announcement — individual aspects of this release.
Solving the Problem — description of the problems relevant to Managed Services Evaluations.
Improvements in Round 2 — deep dive on the major improvements featured this round to address the problems and questions raised.
Next Steps — next steps and timelines for the remainder of this round.

As always, please refer to our Overview Page to get a breakdown of the approach we are taking this year. The Overview Page includes:

Evaluation Summary — the general purpose of this Evaluation.
Methodology — the methodology that will be used in this Evaluation.
Environment — the environment that will be used in this Evaluation.
Target Organization Profile — an example of the target based on the publicly available CTI of the adversary or adversaries selected for this Evaluation.
Reporting — this section describes the “Reporting Statuses” that will be assigned to the reports sent to the MITRE Engenuity Execution Team during the Evaluation.
Additional Metrics — this section describes the new metrics we are introducing in this Evaluation.
Additional Resources — this section contains all the published collateral for this round.

If you are a managed service provider interested in hearing more, please fill out this Participation Form, and we will reply shortly.

Solving the Problem

The hardest problem to solve in Evaluations is maintaining the principles of objectivity, consistency, and rigor within each participant’s engagement, while publishing results that are sufficiently nuanced, without leading to inadvertent editorialization. Sentence structure aside, if that statement was a lot to absorb, consider the complexity of cultivating an effective Evaluation methodology that checks every box in that very narrow niche. While Evaluations themselves have other intrinsic value, solving the problem as stated is incredibly important because it provides for our community’s collective ability to be proactive when fighting against adversaries.

As shared in our initial release article, the ATT&CK Evaluations team intends on solving that problem in the pursuit of achieving these three goals:

Empower end-users with objective insights into leveraging specific commercial cybersecurity capabilities to address known adversary behaviors.
Provide transparency around the true capabilities of commercial cybersecurity offerings to address known adversary behaviors.
Drive the cybersecurity vendor community to enhance their offerings to better address known adversary behaviors.

Each of our goals are structured to advance the cybersecurity community’s ability to address known adversary behaviors, which in turn allows all of us to be more strategic in preventing adversaries from committing data breaches, bringing down critical infrastructure, committing fraud, etc.

It will always be far easier to demolish than it will be to build. Even in Aristotelian logic, the validity of a universal affirmative proposition must be proven true categorically. To disprove a universal affirmative proposition, on the other hand, it is sufficient to demonstrate the truth of a single contradictory counterexample. This theory is held to be true in most regards, and it is doubly apparent when we begin to examine how breaches occur in the real-world. We are successful when we defend categorically, while adversaries are successful when they find just one way to breach our defenses. This is an inherent imbalance, and the unfortunate reality is that adversaries are more successful than we would like them to be. Nonetheless, we are tasked with preventing breaches, and a resilient strategy is the best way to defend against that imbalance because it allows us to proactively cover adversary tactics and techniques rather than reactively defend against the technical observables leveraged by adversaries. The MITRE ATT&CK Framework serves as a foundation for the development of specific threat models and methodologies in the private sector, in government, and in the cybersecurity product and service community. Leveraging the ATT&CK Framework to design and implement a threat-informed defense will allow you to chip away at that imbalance and fortify your defenses on a more strategic basis.

Though describing how to design and implement a threat-informed defense is a deep conversation in and of itself, at its most abstracted level, any strategic threat-informed cybersecurity posture requires a tailored composition of people, process, and technology. The upcoming Managed Services Evaluation Round 2 is structured to evaluate commercial offerings regarding the people (and parts of the process) in that triangular structure. Any substantive Evaluation of managed services offerings, therefore, must be inherently different than a technology (aka product) Evaluation so that the elements of a holistic threat-informed defense not covered by our Enterprise Evaluations can be evaluated.

The most consistent feedback we received from our community regarding the OilRig round was that it was far too focused on detection coverage, even though detection coverage is not what primarily contributes to the decision-making calculus of actual and prospective clients of managed services.

If you are in this camp, I would urge you to re-read content from our last release because the published results dive far deeper than detection coverage. The insights our methodology enables us to discover in each Evaluation round are incredibly unique, and merely examining the colors of each sub-step within the Emulation Plan is an ineffective way to interpret the core of what is published. To mitigate this type of analysis, our strategy for this round is to emphasize far more elements than mere detection coverage by implementing significant changes to the Core Platform of ATT&CK Evaluations. We prioritize consistent efforts to our Core Platform, and it is comprised of the Evaluations methodology, the technology we use to collect data during each Evaluation (called Arcade), and the website where we publish results.

Improvements in Round 2

Now let us readdress the problem.

How do we publish results that are sufficiently nuanced, without leading to inadvertent editorialization?

Our initial answer was to publish everything we collected and let the community interpret the results. As we learned, however, that has led into analyses that are broadly focused on detection coverage — or whatever else is presented at the surface — even if deeper analysis is achievable with the information that was published. To make this deeper analysis more accessible, our new and improved approach is to publish additional objectively and consistently collected qualitative and quantitative data that will help the community interpret the results in a way that can be made relevant to their own strategy for a threat-informed defense.

To capture those nuances in each participant’s Evaluation, it is first important for each managed service to be given the opportunity to define their unique approach. This way the participants can be measured against what they would like to emphasize to their prospective and existing clients. Once their approach has been defined, they must be evaluated against an extremely rigorous Emulation Plan so that they can truly be challenged. After the data has been collected in an objective and consistent manner, the results must then be published in a way that resonates with the end-users who want to use the results. To confirm that we were on the right path, we announced an open call for strategy validation sessions in our initial release article. After engaging with interested parties, the strategy validations sessions were in fact very validating, and the following are the conclusions we drew off the back of that exercise.

What are ways we can make the scenario more complex?

As we investigated how to make the Emulation Plan more challenging this round, we prioritized increasing the breadth and depth of the scenario. While breadth refers to the variety of adversary behavior that is emulated in the scenario, depth refers to the sophistication and technical complexity of those emulated behaviors. Though we cannot share the specifics of these improvements due to the black box nature of our Evaluation, significant resource has been put into improvement in both aspects. Additionally, since our Evaluation methodology is structured to discover how vendors can find the proverbial needle in the haystack, another way we are making the scenario more challenging is by increasing the size and realism of that haystack. To that end, this round will feature more realistic noise generation that will make it harder for participants to discern suspicious and malicious behavior.

How do we approach qualitative information collected in each Evaluation?

The key qualitative categories introduced in this round are attribution and remediation. Attribution is a key component of a strategic defense because it directs teams to allocate resources in a more concerted manner based on observed activity. Many of the managed services participants in the OilRig round had identified the adversary, and this round we are officially supporting the display of that content in the published results. We follow a black box methodology in Managed Services Evaluations, which means that we will not be publishing the Technique Scope until we publish the results from the round. This ensures that participants do not have any insight into which adversary behavior we are emulating. That said, our CTI research has yielded intriguing information, so we are introducing the Target Organization Profile. This target organization profile was generated based on the adversary behaviors that will be emulated in this round. Remember, for both end-users and managed services, we perform adversary emulation, not simulation. This means that the Evaluation and ensuing results are relevant for your organization regardless of whether this target organization profile represents your focus because the emulated behavior should be defended against even if it is not being used by the adversaries we choose. We will be releasing more content regarding the Target Organization Profile in the coming months, so stay tuned for a deeper dive.

The next major improvement is capturing remediation guidance. We investigated a variety of different strategies to support implementing remediation, and ultimately determined that actual implementation of remediation suggestions provided by participants would prevent our ability to collect consistent data on each subsequent step in the Emulation Plan. For instance, if a participant told us to prevent “gosta” from downloading and opening the Word document with the malicious macro in the very first step of last year’s OilRig scenario, we would not have been able to capture results for the remainder of the Evaluation. For this round we will allocate real estate on our results website to publish remediation suggestions sent to us in an organized format. While validating the efficacy of these remediations will be out of scope for this round, they will represent the managed service ‘value add’ to a client in the context of what is being emulated.

What are some quantitative metrics we can capture and publish this round?

There are four main categories of metrics we are looking to capture this round: reporting status, type of report, speed, and efficiency.

Reporting Statuses were purposefully basic in the OilRig round, and we learned that greater depth is required to describe reported activity. While we believe in the importance of elevating the tactics and techniques in most regards, clients of managed services do not require that level of depth in a standard engagement. For example, if you are a CIO who is tasked with cybersecurity and are onboarding a managed service to help, you likely won’t appreciate being told about every single technique/sub-technique an adversary used. This may get frustrating and become noisy in ways you were looking to avoid by onboarding a managed service. Moreover, every CIO would have different requirements for the tactics and techniques they would like in their reports. On the flipside, if you are a client who is augmenting your existing cybersecurity analyst team with a managed service, it is likely that the entire catalogue of tactics and techniques could be incredibly valuable. Evaluations are purposed with collecting data helpful in all circumstances, while also remaining relevant enough to allow for nuanced analysis in specific circumstances. Since each managed service vendor may have a specific approach that caters to a specific clientele, we do not want to restrict their strategy by defining which approach we think is best. To that end, we have introduced two descriptors of “Reported” behavior and the definitions are highlighted here:

1. Reported — MITRE Engenuity was notified of the adversary behavior being evaluated and sufficient context was given to explain the activity.

a) General — activity was reported to the Evaluations team with sufficient context, though no additional information was provided regarding the tactic and/or technique of that behavior.

b) Tactic/Technique — activity was reported to the Evaluations team with sufficient context, and additional information was provided regarding the tactic and/or technique of that behavior.

2. Not Reported — MITRE Engenuity was not notified of the adversary behavior being evaluated or the information provided did not contain sufficient context to be considered “Reported.”

3. Not Applicable — Execution failed, so the managed service provider did not have the opportunity to report on the activity.

Managed services are diversified in many ways, and one of the most important characteristics of diversification is the form factors (instant message, email, PDF reports, dashboard access, etc.) each uses to conduct their services. Each client may only be able to support specific form factors, and even within client organizations, individual team members may have a variety of different preferences. We will highlight that diversity by categorizing the form factor of each report that is deemed to have a “Reported” status.

The participants will define the form factors to be used before the evaluation starts. For example, if a vendor participant uses email and daily PDF reports to conduct their service, each report given a “Reported” status will also include either “Email” or “PDF Report” as an additional level of context for end-users to see. The full list of supported form factors will be determined after the cohort closes, and it is our intent to include as many form factors as is possible. With these added metadata points, each individual report will have a far higher level of definition than last year’s evaluation.

Managed services offerings are also typically evaluated by more than just their ability to report adversary behavior. They are evaluated by the speed and efficiency of their service as well. Within the methodology of our Evaluation, the following is a graphical depiction of what can be objectively and consistently observed in each Evaluation:

Activity Executed — the specific timestamp that our Red Team executed a specific sub-step within the Emulation Plan.
Activity Reported — the specific timestamp that we viewed the report of the executed activity sent by the participant.
Remediation Suggested — the specific timestamp that we receive the participant’s remediation suggestion.

Note: Some managed services may decide to report and provide a remediation suggestion at the same time. In this scenario, the times to detect, investigate, and respond will be the same.

Since speed definitionally requires a value for “time elapsed,” this methodology allows us to support an objective and consistent way to display the speed of the managed service during the Evaluation. Due to the structure and breadth of the Emulation Plan, each measurement of the mean times to detect, investigate, and respond will be shown at the Emulation Plan step level and in aggregate for the entire Evaluation. These are the definitions of the key metrics that will be published:

Mean Time to Detect (MTTD): average time between Activity Executed and when the Activity Reported content is viewed by MITRE Engenuity.
Mean Time to Investigate (MTTI): average time between when Activity Reported content is viewed and when the Remediation Suggestion is received by MITRE Engenuity.
Mean Time to Respond (MTTR): average time between Activity Executed and when the Remediation Suggestion is received by MITRE Engenuity.

Even though we are not validating the remediation suggestions, we will still have the timestamp that those remediation suggestions were sent to our Execution Leads (who act as the pseudo-client in this format). That means that we can capture Time to Respond and calculate the MTTR throughout each Evaluation. Participants may be able to offer inaccurate remediation suggestions to lower their MTTR and MTTI. This is a known limitation of our methodology, but it is important to note that the remediations themselves are verifiable, whether or not the Evaluations team verifies the remediations during execution.

Measuring speed is unproductive without also considering efficiency. Spamming and other forms of trigger-happy alerting may lead to a lower MTTD, MTTI, and MTTR during an Evaluation, but would also not be valuable to a client in the real-world. There must be a control to influence efficient communication within the construct of this Evaluation. One of the key reasons a client would bring on a managed service rather than onboarding another product is because their strategy requires offloading relevant workflows to an external team. If a managed service is inundating your inbox with emails, peppering you with instant messages, creating hundreds of notifications on a dashboard, etc. the service is not being efficient, even if they may be quick to tell you something is wrong. To that end, beginning this round, we will curate a running count of each report sent to us during the evaluation based on the form factor categories defined by the participant. This running count will be shown at the Emulation Plan step level and in aggregate for the entire Evaluation.

With these improvements, we hope to detract from interpretations emphasizing mere detection count and empower deeper analyses of the unique offering of each managed service provider within the context of the Emulation Plan.

Next Steps

If you are interested in participating in this round, please fill out this form. As an update to the approximate timelines we shared, here are the confirmed timelines for the next phases:

Orientation Phase — onboarding all participants to the engagement to prepare them for subsequent phases. This phase will begin two weeks after closing the Call for Participation, and we will be reaching out to participants directly to coordinate.
Setup and Execution Phases — what most consider the actual Evaluation, the first black box engagement will commence with participants on January 15, 2024.
Results Phase — once we finish collecting the results for all participants in the cohort during execution, our team commences a processing and calibration phase. The start date and length will be determined by the size of the cohort.
Publication Phase — after results have been processed and shared back with the participants, we dedicate this phase for our stakeholders to prepare release collateral with access to all the finalized results. This date will also be determined by the size of our cohort, and the projected timeframe for final release is currently the third quarter of 2024.

Thank you and we are looking forward to our most engaging Evaluation yet!

Managed Services Evaluations — Round 2 (2023): Attribution and Speed and Efficiency, Oh My!

Components of this Release Announcement

Solving the Problem

Improvements in Round 2

Next Steps

Written by Ashwin Radhakrishnan