Threat Hunting with Data Science: Registry Run Keys

Published in

Blu Raven

4 min readMar 25, 2021

In data science, the majority of the time is spent on cleaning and normalizing the data. Just like in red teaming/pentest activities where preparation/reconnaissance is the most important step, preparation of data is the most important step in data science activities. In this post, I will explain how we can apply some basics of data science to threat hunting and detect suspicious Registry Run keys on a large scale with KQL in Azure Sentinel/MDATP/MDE/M365D. It is also possible to apply the same method in Splunk and other tools that have the capability of manipulating values based on regex matches(Tip: rex command in Splunk does the same job).

Hypothesis

If there is a malicious item in the Registry persistence locations, it should be somewhat unique in the environment (The malicious item can use masquerading to look like a well-known item and invalidate this hypothesis, I’ll provide a solution for that at the end of the post).

The problem

When you try to apply statistical analysis on the RegistryValueData, below is what you will probably get depending on your environment size:

If you analyze the results without counting the rows, you will see items like below:

"C:\Users\testuser1\AppData\Local\Temp\test.exe""C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise" --runOnce --installSessionId 25a56748-489d-4257-814e-fa884df0599"%ProgramData%\Microsoft\Windows Defender\platform\4.18.2101.9-100\DefenderCSP.dll""C:\ProgramData\Package Cache\{29f85b7a-f685-45c3-a213-e306549e95a4}\Setup.exe"

The values in bold make the items unique and make the result set difficult to analyze.

Solution

If we can normalize these values, the frequency analysis probably gives better results. To do that, we can use replace function (rex in Splunk) to create a new field that has the normalized value. Before doing that, we need to analyze the result set and find what kind of values cause the uniqueness. Then we need to prepare regular expressions to be used for the normalization. Using the same data set and normalizing the values gives the below result:

Result from 30d of data after normalization

80% reduction in the result set!

Finally, we can use an anomaly condition and get all the details of unique items back from the dataset and display only the ones from the last 1d:

Result without manual whitelisting(except one)

From 50.000 to 1.000!

If we analyze the result set further, we would probably see some specific folders or conditions that still cause the uniqueness. We can further filter or normalize them to reduce the result set. We can also analyze the processes and find the ones that legitimately modify the Registry, and filter them out.

Software installations or deployments in the environment often modifies the Registry Run items. We can generate a list of well-known/trusted processes and exclude them. Microsoft Defender has a built-in function that we can use for this purpose. invoke FileProfile() retrieves prevalence information of a SHA1 value. By using it and filtering the possible well-known processes (except some important ones because they can be used for this malicious purpose):

Just 4 results to investigate!

In the case of masquerading the registry value, we can apply the same method on the RegistryValueData-InitiatingProcess pair. As an example, if the Registry item Firefox.lnk is not added by Firefox installation/update itself, it is highly suspicious.

Conclusion

As you see, applying just a bit of data science can do wonders in threat hunting. Just by normalizing the data and performing frequency analysis (data stacking), it is possible to detect malicious activity that involves a technique difficult to detect.

You can find the query in my Github repo. If you want to learn more about data stacking, FireEye has a decent post here (or here).

Thanks for reading this article! If you have any questions, leave a comment below. Want to master KQL for Threat Hunting, Detection Engineering, and DFIR in a hyper-realistic environment? Visit my academy for a free course!

Mehmet is the founder of Blu Raven Academy. He brings over 15 years of experience in cybersecurity, with a unique blend of expertise in KQL, threat hunting, detection engineering, and data science to his courses to help others advance their skills. Recognized four times as a Microsoft Security MVP, he is renowned for adapting the RITA beacon analyzer to KQL, developing novel methods for detecting threats, and for his insightful presentations at key conferences like the SANS DFIR Summit.

Threat Hunting with Data Science: Registry Run Keys

Hypothesis

The problem

Solution

Conclusion

Written by Mehmet Ergene