The Limitations of Mean time to Detect

Series: Measuring the effectiveness of a detection engineering program

Gary Katz
10 min readJun 21, 2023

This blog series is based upon excerpts from a book I am writing with Megan Roddie and Jason Deyalsingh.

Within this blog series we will look at effectiveness from multiple perspectives:

  1. Historical Detection Effectiveness: How effective was the organization in detecting previous historical attacks. This is primarily shown through mean time to detect.
  2. The Building Blocks: Defining true positives, false positives, etc. and using them to calculate precision, recall and graph ROC curves.
  3. Detection Coverage: How effective is the organization’s detections in identifying different attack techniques.
  4. Detection Drift: Automatically identifying false negatives to determine how much our detections are becoming less effective.
  5. Detection Volatility: How long are detections effective once they are deployed or updated?

The Limitations of Mean Time to Detect

There are few metrics describing the high-fidelity effectiveness of a detection engineering program. This blog series will dive into metrics attempting to answer some of that question. Effectiveness is a measure of success. How well is the detection engineering team actually helping to achieve the SOC’s goals, which should have a direct correlation to the value the team provides. Quantifying the value of a cyber security program is always difficult. The cost of a cyber-attack cannot be quantified until after it occurred, and the damage has been accessed. It is therefore difficult to quantify how much stopping an attack has saved the organization. Should the value of a cyber security organization be judged based upon the number of attacks performed by adversaries? Should an attack attempting to steal a random user’s credit card number be counted the same an advanced adversary attempting to steal the company’s intellectual property or hold it for ransom? Stopping thousands of indiscriminate phishing attacks may not matter as much to the company’s bottom line as the stopping one determined adversary.

Rather than attempting to calculate the financial costs of successful attacks against an organization or the more difficult task of calculating the potential cost of a prevented attack, we will evaluate effectiveness as how difficult it is for an adversary to evade our detections. In this blog, we will first review one of the most popular metrics in historical SOC effectiveness and understand its limitations. Next, we will review some definitions for true positives, false positives, true negatives and false negatives. These definitions are the building blocks for many metrics within a DE program. We will look at high fidelity validation coverage metrics to evaluate how well the detection engineering program has supported the SOC in preventing future cyber-attacks against the organization. The series will then introduce some additional metrics that will identify when detections begin to decrease in effectiveness and how to determine which detections are effective over longer periods of time.

Historical effectiveness of a SOC is a look at how well the organization was protected against cyber-attacks. Traditionally this is identified by looking at the Mean-Time-To-Detect (MTTD), which is defined as the mean time for the organization to detect an attack performed by the adversary. The metric has the benefit of being easy to calculate. Take the duration of each attack (detection time minus start time) for each attack and then take the average of those values. The metric is popular for good reason. The goal of the SOC is to prevent an intrusion, and when one occurs, we need to quantify how quickly it identified that attack. MTTD is an excellent way to depict that understanding. It is also easy to compute and is well understood by the industry.

Where n is the number of attacks and i is the attack instance.

Despite its popularity MTTD suffers from being a historical evaluation of effectiveness. You can only calculate this metric AFTER the adversary has performed an attack and it assumes the SOC has identified the attack in the first place. If the adversary is fully successful, then the attack is not included in the MTTD metric. MTTD only includes attacks you have stopped and does not include attacks that you never detect. This doesn’t mean that mean-time-to-detect is not a valuable metric. It is. But if you are attempting to provide metrics on how valuable the work your team is doing to prevent the next attack, MTTD does not describe that. It describes historical facts, not future ones.

Historical performance is only valuable within the context of understanding the adversary. The adversary is lazy. They will use the same attack as long as it is successful. This means that either the attack was successful on your network, and they believe it will be successful again (within some minimal variation) or that the attack will be successful on someone else’s network and they’ll bother those people and leave you alone. The attacker though will alter their procedures if they find it to be unsuccessful within those parameters, which the above metric does not reflect. It says nothing about how your SOC will perform against significantly new (relative to your organization) types of attacks. The metric only defines how well the SOC performed to yesterday’s attacks.

MTTD is also defined by the adversaries. The adversaries define the variation in the tests used for the computation and there is no guarantee that their tests are complete. Instead, the opposite is true. The ‘test’ performed by the adversary is one attack for the technique that was detected. Forensic analysis will identify additional techniques that the adversary performed successfully prior to detection. The analysis can also be used to extrapolate further in the kill chain using the processes defined in Lockheed’s Intelligence Driven Defense paper. This analysis though is only identifying the attack procedures from the one attack.

The distinction is a key issue limiting MTTD’s use as an effectiveness metric for detection engineering. The metric is basically the equivalence of a User Acceptance Test (UAT) performed on a software system. The users play around, and the developers see if anything breaks, some feedback is provided on usability or performance of the system, but a development team would never use UAT as their primary test performance metric. Instead, software systems have test coverage and pass rate metrics defined by testing teams identifying edge cases. Developers build unit tests to check nothing breaks as updates are made to code. Automated static and dynamic analysis of the system is used to test the system for security vulnerabilities. Detection Engineering effectiveness metrics should similarly reflect the coverage and pass rates to detect or mitigate the full range of potential attacks rather than only those seen in previous attacks.

To define and track true effectiveness metrics for detection engineering we need to put some limitations on what we track at which fidelity. There are over a dozen MITRE ATT&CK matrices for various environments each with techniques and sub-techniques numbering into the hundreds which can then be executed using a range of procedures. It is therefore impractical to track effectiveness metrics on each of these attack techniques and sub-techniques. Any organization needs to prioritize where to focus their energy and these focus areas also define the fidelity at which our metrics are tracked. MITRE has released some tools to support this effort, including the MITRE Top ATT&CK Techniques (https://top-attack-techniques.mitre-engenuity.org/) that will help propose the most impactful techniques a team should focus on detecting based upon a short survey about the organization. Organizations can also use both internal and external threat intelligence as well as knowledge about their organization’s infrastructure and assets to prioritize their detection engineering efforts. This analysis should result in grouping techniques into one of three categories.

1. Top 10 or 20 High Impact Techniques: Techniques the organization has identified as high value that should have a concentrated effort detecting. These are techniques that we either have completed or plan to complete an in-depth investigation into the attack procedures resulting in high coverage detections. As the team works through these techniques additional ones can be identified from the second group and migrated to this group.

2. Medium or High Impact: Important techniques that we want to track but have accepted the risk that we will not be performing an in-depth analysis. For these techniques we will create detections based upon open-source reporting and leverage 3rd party detections, but they are not currently part of our backlog to perform in-depth analysis of the technique.

3. Low Impact or Low Fidelity/Visibility: Low impact, not applicable, or rarely used techniques (ex: reconnaissance is low impact)

Let’s review each of these, starting with the least critical and working our way upward. Low impact or low fidelity/visibility techniques are techniques which the organization has accepted as having minimal detection. This may also include techniques which are not applicable based upon the organization, such as ICS specific techniques in a Windows only environment or have been mitigated in other ways. Alerts from detections that do exist for these techniques will most likely be informational rather than reviewed by an analyst. Medium to low maturity SOCs may ignore default alerting or put no effort in creating detections for these techniques. For example, we may have tools which detect scanning of our network but only use this information for forensic analysis. In these circumstances coverage metrics do not make sense. We have accepted the risk of not having visibility. Instead, we only want to capture efficiency metrics. This allows us to identify whether SOC analysts are still triaging these alerts and how much valuable time is being spent on them. Even though we are not building detections for these techniques, identifying and categorizing them is still valuable so they are not included in our medium and high impact technique metrics.

Medium to High Impact techniques are those that the organization has identified as valuable but have not identified them for in-depth analysis. In these circumstances we may be reliant upon detections included within 3rd party tools or open-source detections. For most SOCs the majority of techniques most likely will fall within this category. In these circumstances we may create some low fidelity coverage metrics such as the number of detections per technique, or efficacy metrics such as false-positive rates.

High Impact Techniques are techniques the SOC has identified as commonly used by adversaries that target their organization’s infrastructure and thus are marked for in-depth analysis. These are techniques that the SOC is building custom detections for and should track metrics that answer the difficult question of, how good are we doing?

Tiering techniques allow a detection engineering team to prioritize their resources and capture metrics which reflect those prioritizations. The below table shows an example of which metric types could be captured per tier.

A common low-fidelity coverage visualization is mapping your detections to a MITRE ATT&CK matrix. Each technique is colored according to the number of detections that have been created for it. This visualization is easy to produce, and many tools including MITRE’s ATT&CK Navigator will automatically provide some form of the matrix. The number of detections though is a poor representation of coverage. One well-crafted detection may have higher coverage of the technique than ten detections and ten detections of individual procedures may provide little value if the adversary can choose from hundreds of variations. If a leader has trust in the quality of the detections created by their organization, the visualization does have value, especially when paired with intelligence. Comparing your detection attack matrix to other matrixes can help prioritize new detection creation, support gap analysis, and identify which techniques should be included in your coverage calculations. There are numerous examples of using MITRE ATT&CK in this fashion.

Properly implementing metrics around the value provided by a detection engineering team can be difficult. Some are misleading; as the most straightforward of these metrics are counts, such as the number of detections created per MITRE technique, or the number of hunts performed per month. Our goal is not to create metrics on how many things we did (detections created, hunts performed) but how that work relates to stopping the adversary, such as a SOC tracking the percentage of devices patched per patch criticality or a detection engineering team tracking how well it can detect the variations of a technique.

In addition to directly relating to how well the organization can respond to future attacks, these metrics also have the added quality of describing something that the detection engineering team controls and can constantly compute vs. MTTD which is defined by the variability of the attacks the SOC has detected and how often those attacks occur.

Some variation of the patch metric is used in many organizations today, as it is relatively easy to calculate. It is much more difficult to have a robust calculation of detection coverage. In the next several blogs we will explore some additional thoughts into detection engineering metrics, including some thoughts on how to accurately calculate detection and hunt coverage for high impact techniques.

As a summary, while MTTD is an important metric for any SOC to use, the following limitations should be considered if you are using it as your primary metric:

1. The metric provides a historical view of performance which may not be indicative of the SOC’s ability to respond to future attacks.

2. The metric only includes the attacks that have been detected, if the adversary is completely successful the attacks are not included.

3. The parameters and number of data points which define MTTD are created by the adversaries that have historically attacked the network. They do not reflect the potential variability of attacks.

4. The metric does not reflect the work the organization is continuously performing to prevent future attacks.

Hope you enjoyed this topic and please reach out with any feedback or thoughts.

The second blog in this series covers building block metrics and can be viewed here

--

--

Gary Katz

A mix of software architecture, cyber security, coffee and cocktails