Are We Using MITRE ATT&CK Data Sources Wrong?

5 min readJan 6, 2022

Summary: MITRE ATT&CK data sources do not offer information about the volume of the data, how many “sensors” need to be monitored, or indeed if the data source can be collected at all. This blog offers additional metadata for each data source to help you better prioritize your data collection.

Let’s say you’re setting a SOC from scratch and someone asks you to prioritize the data sources you’d like to collect, preferably using as objective a method as possible. How would you do it?

Well, if you use MITRE ATT&CK’s data sources as a guide, you can choose to rank your data sources by the number of Techniques or Sub-Techniques they cover. If you do, you would end up with a list like this:

Based on this, “Process Creation” and “Command Execution” would be a good place to start. They cover the most Techniques and Subtechniques respectively. My question is, if you really had absolutly nothing else collected, would you start with “Command Execution”?

Let me explain. To have full coverage for “Command Execution” over your networks, you would need to collect data from each and every host on your network, and the data volume can be quite high, even for a few hosts. But the thing that troubles me the most about “Command Execution” is that there are a lot of different ways to do the exact same thing. The fact that you can execute code in the “data source” (that is, you can embed scripts in your inputs that can obfuscate what the script is meant to do) means that the data source is effectively polymorphic.

So, does this mean that “Command Execution” is a bad data source? Not really. But then again, in a hypothetical situation where you don’t have any other data sources collected yet, would you recommend “Command Execution” as the first data source to collect? I’ve seen a few organizations buckle under the pressure of capturing high volume data sources while low-volume but high value ones are not collected yet.

So in order to create a better prioritization, we would need more data to base our prioritization on. I’m proposing four new fields: Collectable, Number of Sensors, Volume, and Polymorphic. Below is a sample of what I’m proposing. You can find the full list for all data sources in csv format at my github repo here.

TareqAlKhatib/ATTACK_data_sources (github.com)

Collectable: Whether a data source can be collected in a SIEM solution. In the example above, “Firmware Modifications” is an example of a source that does not fit into a typical SIEM data source. Similarly, data from “Malware Repositories” (think VirusTotal) can be used for enrichment of other data sources, but is rarely a data source onto itself in a SIEM.
Number of Sensors: The number of devices that need to be monitored to cover this data source completely. For example, if we are monitoring “[Cloud] Image Creation”, you would only need to monitor the cloud API usage. Similarly, if we’re monitoring “[Domain] User Account Creation”, we’d only need to monitor domain controllers. On the other hand, monitoring “WMI Execution” would require monitoring every [Windows] machine in the network.
Volume: The real answer to “how much data can we expect” is always “it depends”. That said, even a “High / Low” split would be a good start for this usecase. (P.S. I had to put “Network Traffic Content” as the single “Very High” volume data source simply to give it a more realistic placement on the list).
Polymorphic: whether the data source can execute scripts / commands / programs / etc that can be used for obfuscation.

With these, I hope you would be able to create a better prioritized list for the data sources you want to collect based on your own requirements. That said, while my goal is not to tell you want you should do, I should at least show you how I would prioritize these data sources:

Prioritize sources that can be collected, for obvious reason.
Prioritize sources with lower volume.
Prioritize sources that do not require collection from all hosts.
̶P̶r̶i̶o̶r̶i̶t̶i̶z̶e̶ ̶s̶o̶u̶r̶c̶e̶s̶ ̶t̶h̶a̶t̶ ̶a̶r̶e̶ ̶n̶o̶t̶ ̶p̶o̶l̶y̶m̶o̶r̶p̶h̶i̶c̶. This rule would have put “Command Execution” practically at the end so I skipped it.
Prioritize sources by the number of subtechniques.
Prioritize source by the number of techniques.

With these rules, the top 5 data sources would be (full list in the github repo):

Active Directory Object Modification
User Account Metadata
Container Creation
User Account Modification
User Account Creation

Command Execution with this prioritization sits at number 78! Which might be closer to where I want it to be but even I can admit there are a few data sources I’m not sure about. For example:

Both “Network Traffic Flow” and “Network Connection Creation” sit above Command Execution. And while I can argue why you should do one of these before “Command Execution”, I can’t really argue having both. The two are fundamentally the same thing but collected from either a network or host point of views. Collecting both would be overkill even without other data sources on the waiting list.
Both “Cloud Storage Access” and “Cloud Storage Modification” cover only 2 techniques each. I was tempted to push them lower either manually or through some extra special case rule, but I thought it would be better to keep the rule set simple, even if it meant a few warts here and there.
While the initial assumption was that we need to do full coverage of a data source before moving on to another data source, that does not necessarily have to be the case. For example, you may choose to collect “Command Execution” from your server segments but not your Workstation ones. That said, in your server networks, would you prioritize the low volume data sources before your get to “Command Execution”.
The proposed system does not cover whether the data source can reliably detect a technique. For example, the now notorious “Command Execution” can detect “Service Creation” but not as reliably as the “Service Creation” data source (AKA Windows EID 4697). If “Command Execution” can really detect 255 techniques and subtechniques but not reliably, would we prioritize it as highly as we do?
Finally, the proposed system does not take into account how important a technique is, especially for each different organization. Data collection might be different based on each orgnization’s unique challenges.

So the proposed system might be a step in the right direction, but perhaps it is still not all the way to a truly objective prioritization of the Data Sources. Some “art” is still required in Data Source prioritization.

P.S. If you’re interested in Threat Hunting or Detection Engineering, you may be interested in checking out our newsletter at the link here: https://threathuntersdigest.substack.com

Are We Using MITRE ATT&CK Data Sources Wrong?

Written by Tareq Alkhatib