How To Objectively Measure A Detection Rule’s Strength

Tareq Alkhatib
12 min readAug 3, 2022

--

Summary: Rule strength is a function of the level of control the attacker has over the rule fields, blacklisting vs. whitelisting, data source coverage, host coverage, and data volume.

There are different ways to test your detection rules. The most common of which is to use automated tests like Atomic Red Team or to rely on human testers in the form of Red or Purple Teams. You try to figure out how to execute the attack while circumventing your detection. In that sense, if a new attack method goes undetected, a rule would be considered incomplete until the failed test is addressed.

That said, it is not always possible to create a full range of tests for rules, especially when they are in their first stages. Tests are also not cheap, so trying to test each and every limit of each and every rule can get prohibitively expensive quickly. As such, we need a systematic method of measuring the strengths and weaknesses of rules, at least until we get to the point where it is feasible to create tests for them.

Over the years, I’ve created a few general rules of thumb on how to write rules. This is my first attempt at codifying them into something more concrete.

Photo by patricia serna on Unsplash

Host Coverage

I’m going to discuss this first because it usually gets missed. A detection rule is useless if we are not collecting the logs needed for said rule, so measuring the completeness of the data collection should be part of the coverage of the rule.

Data sources however differ in the number of hosts needed for coverage. For example, some logs are only relevant to domain controllers, email servers, or networking devices, while others need to be collected from every endpoint. It is worth defining the total number of nodes sending relevant logs and then monitoring whether these logs are received or not. This means that sources that require fewer monitored endpoints should be considered better than those requiring more endpoints, at least for this particular metric.

For example, let’s say you want to detect T1136.002 (Create Account: Domain Account). ATT&CK’s page for the technique lists three data sources for detection:

  1. Command
  2. Process
  3. User Account

That is, you can either detect the creation of new domain accounts by monitoring the use of the net command or by using Event ID 4720 (A user account was created). I’ve made the argument before that Command and Process are the same thing as defined by ATT&CK, but for the sake of argument, let’s just agree that both require monitoring of a lot more endpoints than User Account, which only needs to be monitored on Domain controllers for domain account creation (local account creation would require monitoring all hosts, similar to Command and Process).

You can find a complete list of the number of hosts that need to be monitored per ATT&CK data source on my Github repo here.

Data Volume

The volume of data is usually but not always related to the number of endpoints the source is monitoring. For example, monitoring Service Creation would require monitoring all endpoints in a network but it is usually less in volume than monitoring Network Traffic and definitely less than monitoring Network Traffic Content.

While the volume of most data sources depend on the network, a simple “high/low” estimate would still be a good start. You can find such a list on my Github repo here.

Data Source Coverage

Next, we need to investigate whether a data source would always generate the relevant logs when the attack is executed. Going back to our New Domain Account example, one can assume that the creation of a new domain account will generate an Event ID 4720. There are however many ways to create a new domain account without the use of the net command (Powershell, for example), so we cannot assume that using the Process/Command data sources are sufficient to detect this technique. (Also, for those pedantic enough, ATT&CK does not list Powershell Event 4104 under the Command data source but rather the Script one).

It is worth dwelling on this point a bit. I’ve seen a lot of emphasis on using command line arguments to detect a wide range of attacks. This can be part of better coverage but we should always try to look for a “canonical” data source that would always be generated when the attack happens, which usually means trying to rely on logs generated closer to the application than the shell. This is the basis of why I previously argued that LOLBINs should generate their own logs.

This is also one of the many reasons I said that we cannot detect techniques in the Execution tactic. These techniques (Powershell, WMI, Native Calls, etc) are mostly specifying the data source rather than any specific malicious activity. That is, they are focusing on the method and not the goal. There is no way to look for a better data source because the data source is the whole point of the technique.

With this, we have three fields in our coverage metric:

Rule Fields

Let us now discuss the fields used in the rule itself. Specifically, we want to focus on the level of control the attacker has on a field. As such, we define these categories:

  1. Privileged: The attacker needs special privileges to change the value in these fields. In the 4720 example, the Subject Account (the account the attacker is using when creating the new account) would be an example of such a field. If only a certain set of administrators are allowed to create new accounts, it would require attackers to gain access to these accounts before they can create new accounts on the domain. Privileged fields are particularly good when whitelisted because of the added cost to attackers.
  2. Limited: The attacker does not need special privileges to change the value in these fields, but the list of available options is limited. For example, there might be quite a few unprivileged processes that an attacker can leverage in their attack, but the list is limited by which processes are already running on the machine.
  3. Random: The attacker can change the value in these fields without special privilege. Using the 4720 example, the name of the newly created account would fit into this category. Some threat actors may create accounts with names matching a specific pattern that might be used for detection, but once the threat actor knows that the pattern is exposed, it does not cost them anything to change the pattern into something else.
  4. Polymorphic: These would be scriptable or programmable fields. Examples of these would be Powershell, BASH, or JS scripts. The fact that the field is a program allows the attacker more ability to obfuscate their actions. Polymorphic fields are particularly bad when blacklisted since there are usually many ways to accomplish the same thing. Depending on how narrow your blacklisting really is, trying to blacklist these methods one at a time would be trying to whack-a-mole your way into good coverage.

The term “Polymorphic” is inspired by the issues faced by anti-malware vendors. The malware binary can be changed as many times as necessary to avoid detection. The fact that malware writers also have access to the anti-malware software means that they can keep fuzzing their malware until they break the detection of the target vendors. This is why anti-malware as a technology is on the decline despite the threat never really going away. Trying to blacklist a binary that can be manipulated infinitely is ultimately a losing cause.

Putting these definitions in use, we assume that writing a rule restricting who can create new accounts (privileged field) would be better than one restricting when the new account is created (limited field).

Blacklisting vs. Whitelisting

Blacklisting vs. whitelisting is probably the question when it comes to writing new rules. Blacklisting is great for quick and dirty rules while whitelisting rules might take longer to set up but once tuned you can expect them to have a longer shelf life than their blacklisting counterparts.

And just to make sure we’re on the same page, blacklisting essentially defines a list of values that would cause an alert to trigger. For example, I can create a rule to trigger an alert if the image name from process creation matches a list of common malware or potentially unwanted tool names (think mimikatz.exe or psexec.exe). Whitelisting is the opposite. It is defining an allowed list of values and anything else would cause an alert to trigger. For example, I can create a rule to trigger an alert if anyone other than an allowed set of administrators logs into the domain controller.

Without getting into set theory, and assuming an “AND” operator between conditions, we can agree that adding more conditions to a whitelist makes it stronger. For example, only services signed by these vendors are allowed on these servers AND only installed by these administrators AND only installed on these days is stronger than any of these conditions alone. On the other hand, adding conditions to a blacklisting rule makes it weaker. For example, seeing an executable with this name AND it has this hash value is weaker than either of these conditions alone (assuming the conditions are strong enough not to cause false positive on their own).

Yara rules are the most common abuser of too many blacklisting conditions, possibly due to the x of them syntax. Consider the following Yara rule from The DFIR Report that detects a fake version of msiexec.exe:

The rule requires quite a few matches to trigger. That said, most of the strength of the rule comes from the two $x strings targeting the malware’s PDB string. The real msiexec.exe has a PDB of just msiexec.exe so both PDB strings in the rule should be strong enough on their own on this fact alone. That said, we can assume that the real msiexec.exe is not built on a computer where the developer is logged in as the “Administrator” user as shown in $x1. $x2 is even more damning, with M:\work\shelll (with three Ls) being unique enough for detection alone. That said, I also understand the rule author’s intention of erring on the side of caution when it comes to false positives.

With all this in mind, our chart would look something like this:

Putting it all together

Example 1:

Let’s try a few examples. We can take a look at one of my favourite rules first: “Shells Spawned by Web Servers” by Thomas Patzke

Data source Analysis:

  • Data Source Coverage: although most webshells spawn processes from the web server process (w3wp.exe for example) to a child shell process (cmd.exe for example), it is possible that a more sophisticated attacker would be able to bypass process creation as a whole by injecting into a different process. This is obviously not trivial, but since the possibility exist, we cannot say that all webshells would result in a process creation. As such, the coverage for the data source itself is listed as “Partial”.
  • Data Source Volume: Like most things, volume differs between environments and configurations, but monitoring process creation is usually on the noisier side.
  • Host Coverage: Process creation information can be collected from most hosts in the network. That said, since the rule is focused on webshells, one only needs to monitor process creation on web servers for this particular rule.

Detection fields Analysis:

  • Blacklisting vs. Whitelisting: The rule blacklists five “shells” from being spawned by the web server process (bitsadmin.exe being the odd one out for not being a shell. That said, it is still able to both download and execute scripts and processes). This means that the rule is “blacklisting” one field. We can debate whether the parent image is another blacklisted field or if we can assume it is part of the rule’s “setup” since the rule focuses on these web server processes.
  • Fields (Image): While the attacker can run any number of executables on the target machine, the list of executables they can use is “limited”. The attacker does not typically need special privileges to run a shell, as such, this field is not considered “Privileged”. Also, the script or command being executed is not part of the rule, so the field is not “Polymorphic”.

Since the rule is doing blacklisting, the easiest method to break the rule is to spawn a process that is not on the blacklisted list. For example, an attacker can use rundll32.exe, which is not on the blacklist, to do their execution.

Had the rule whitelisted a list of processes that the web server is allowed to spawn, the attacker has to either figure out a method to use these processes to execute their commands or try to bypass the rule on the data source level by using other methods, like process injection, to execute their commands.

Example 2:

Let’s try something different this time and try a rule with Named Pipes. Specifically, “CobaltStrike Named Pipe Pattern Regex” by Florian Roth.

Data source analysis:

  • Data Source Coverage: Since the rule explicitly focuses on named pipes, we can assume that the data source has complete coverage for the purposes of this rule. If, however, the rule claimed to detect any CobaltStrike inter-process communication, we can make the argument that there are different methods to achieve that without using named pipes.
  • Data Source Volume: The volume for named pipe usage is relatively low.
  • Host Coverage: All hosts need to be monitored since Beacon can be installed on any host.

Detection fields Analysis:

  • Blacklisting vs. Whitelisting: The rule uses different regexes to blacklist the pipe name field. That is, a single field is blacklisted.
  • Fields (Pipe Name): The attackers can set the pipe name to whatever they want without restriction. That said, they cannot actively obfuscate the field using scripts or similar. As such, we list the field as “Random”.

CobaltStrike is notoriously “malleable”. This means that attackers can easily set their pipe names in a manner to avoid these detections. This does not mean that the rule is not useful. Just that an attacker can easily avoid detection if they know that this is an avenue for detection.

Open Questions

I don’t claim the proposal above is complete by any means. Below are some of the questions that I am still struggling with:

  1. What about threshold-based rules? That is, rules that trigger if a certain event occurred more than X times in Y minutes. These don’t fit too nicely into our blacklisting/whitelisting dichotomy. More importantly, how would one objectively determine if the threshold is not too high or too low? At what point should one create different thresholds for different departments, technologies, or servers?
  2. What about ML-based rules? How would one determine the strength of a rule if there is no clear logic defined for it? Most organizations do not have enough true positives to properly train ML models using their data alone so this can create confirmation bias or blind-spots.
  3. Should we generate a score for rules based on the above? Maybe. My goal was to determine the blind-spots of a rule. Comparing the strength of rules is more complex. See next question.
  4. What about strategic vs tactical rules? For example, if a piece of malware always uses the same PDB string, would it be okay to create a Yara rule to detect said string knowing fully well that it is easy for the attacker to break the detection? Renaming files is also as simple as it can get but wouldn’t you want to know if someone ran an executable called mimikatz.exe in your network? These rules might not be the most versatile but they are far from useless.
  5. What about the strength of multiple rules covering the same technique? I assume you can call rules complementary if they use different data sources or rely on different fields. Rules that use the same sources and the same fields should be considered duplicates and can be merged into one.
  6. How does this fit into coverage graphs? I wish I knew. I started this whole thing because I wanted a method to highlight which techniques require more attention and which ones are okay. To be fair, if you draw the metric box for each technique, you may get something resembling that. But, as mentioned above, the whole system breaks down once you introduce more than one rule per technique.

P.S. If you’re interested in Threat Hunting or Detection Engineering, you may be interested in checking out our newsletter at the link here: https://threathuntersdigest.substack.com

--

--

Tareq Alkhatib

Cyber Nerd | Father | Chocoholic | All opinions are my own and not my employer's | https://threathuntersdigest.substack.com