Cyber and AI: Separating Fact from Fiction at Peak Hype

Published in

security analytics

15 min readFeb 1, 2019

By now, we have all seen an article claiming artificial intelligence (AI) is the solution to all of our detection problems or an unstoppable automated hacking force. AI has become the buzzword that every security company must include in their marketing material to keep up. So how do you separate strong technology from marketing hype? In this article I’m going to lay out common applications of machine learning (ML) in security and the questions you should ask every vendor to cut through the marketing hype and understand if their technology will be useful to you. I’ll start with a definition of AI to ground the rest of the post, followed by a brief analysis of techniques and what you can do to better vet vendor engagements. I could fill a book with an in-depth analysis, so I will focus on hitting the key points.

Defining AI

The most common request I receive from customers is a definition of AI. After analyzing hundreds of definitions from researchers, practitioners, and our own internal definitions, my team came up with a simple definition we feel helps customers:

Artificial intelligence is the design of agents that perceive their environment and act to meet an objective, often without being explicitly programmed to do so.

Although this definition is admittedly broad in technological scope, it’s the second part which is critical to understand in a security context. Early AI implementations relied on handcrafted knowledge that was programmed into the agent. This approach is brittle in two ways, 1) There is no learning or handling of uncertainty and 2) The approach is not scalable as the agent must learn to react to an ever changing environment to be useful, requiring recurring human updates. This is a great proxy for the emergence of ML capabilities in security as practitioners realized that building ever growing rule and signature lists was an unsustainable approach to defending against an intelligent adversary. We can certainly squabble over semantics, but let’s go with this definition for the purpose of this article.

As our definition states explicitly, learning will be a key ingredient of the technology. There are three forms of learning that have become essential components for security ML vendors: unsupervised learning, supervised learning, and reinforcement learning. I will save a deep dive on each technique for another post, but a quick outline should inform the reader enough to dig deeper:

Unsupervised learning: an approach used to model the underlying structure or distribution of the data without the benefit of labels (e.g. clustering users into similar groups)
Supervised learning: using labeled data to build a model which can classify or make predictions based on characteristics of new, unseen data points (e.g. classify malicious urls based on a labeled data set of known malicious and benign urls)
Reinforcement learning: goal-oriented algorithms which learn how to achieve a complex objective or maximize some notion of a cumulative reward (e.g. attempt to maximize the cumulative score in a game)

Now that we have a definition and a broad notion of the techniques we can deploy, let’s take a deeper dive into the market.

Areas of Application

The growth of distributed computing platforms and easy to use deep learning packages seems to have generated a surge in ML based security products. In general, there are three major use cases for analytics in cybersecurity: endpoint, network, and security operations center (SOC) automation. Each of these categories contain two or more subcategories, some examples shown in the above diagram. Let’s take a dive into each of these disciplines to better understand use-cases and possible (and sometimes common) pitfalls.

Endpoint Detection

Endpoint detection saw some of the earliest growth in the use of ML in security, highlighted by unicorns like Cylance and Crowdstrike. However, the same techniques have been deployed by other upstarts including Endgame, FireEye, Palo Alto, Sentinel One, and Sophos as well as the more established anti-virus (AV) tools from Symantec, McAfee, Trend Micro, Kaspersky, Bitdefender, and Microsoft. The growth of “next-gen” AV in part has been a function of the problem it is trying to solve. With the rise of open source intelligence repositories like VirusTotal and VirusShare, the ability to get vast quantities of labeled data has never been easier. This allows companies without a large user base (e.g. startups) to download examples of malicious documents from which a model can be built. As is the case for all supervised ML applications, large quantities of labeled data that accurately represents the population is critical to the learning process of the algorithm.

You may have noticed I added the phrase “accurately represents the population” in describing the procurement of data. This is a critical fact that in many ways limits the capabilities of these solutions. Although there is a wealth of data available, the challenge for endpoint solutions is building a robust model that performs well in the face of an intelligent and ever changing adversary. This places pressure on the vendor to accurately represent these behavioral changes in their training set, and thus the model. Examples of malicious files bypassing AV solutions are frequently posted on popular social media sites like Twitter by top notch security researchers (see example). And then there was the “Hello World” controversy when a user submitted the simple Hello World training example for new coders and noticed many vendors flagging the file as malicious [1]. This caused some to suggest that these tools aren’t in fact understanding the underlying behavior of the file, but memorizing some facts about the training set (e.g. small files are always bad). The vendor explanations are worth a read for anyone interested in ML based endpoint detection tools.

If you follow many of these files over time you will often see an increase in vendors detecting the file as malicious. What this may suggest is 1) the vendors are retraining the model with the new samples (always good) or 2) there are new signatures for the file added to the product (always good). This is not to suggest that these products are not useful, they are, but rather to highlight the challenge in making predictions in a feature space not well represented in the training data. There is a fantastic article ([2]) that highlights two common issues in malware prediction that lead to inflated accuracy reports that I believe is likely an issue with some commercial offerings and not just academic efforts.

Examples of Leading Endpoint Security Tools

Finally, let me discuss the method most credited for creating these next-gen products, deep learning. Deep learning is a powerful form of ML that has been used in a range of disciplines, from beating Go champions to helping self driving cars. Like other ML models, deep learning requires a vector representation of the underlying data as input. In general, these features are hand engineered by subject matter experts to reflect important aspects of the underlying observed process. Another technique, embedding, which represents discrete variables as continuous vectors, has emerged as a means of allowing these models to “learn” relationships among the raw data variables. This allows, for instance, a user to feed the model the raw byte representation of an executable to predict if it is a malicious file. The thesis being that this powerful technique will allow the model to generalize on behaviors without the bias created by subject matter experts (i.e. engineering features based on malware they have seen in their career). However, a number of experts in this field believe that the learned features using embedding tend to perform more poorly than hand engineered features. We don’t have time to go deep on this topic, but I would point you to the work of Scott Coull at FireEye whose team has done amazing work around understanding what these deep learning models are actually learning and how that relates to well understood features of malicious files (see links below [3–4]). As powerful as these capabilities are, there are still plenty of avenues to explore to improve the technology.

Network Detection

Unlike endpoint detection, detecting adversarial behavior via network data requires a multi-faceted algorithmic approach. When I use the term network detection, I’m talking about firewall, proxy, DNS, etc. as data sources. The objective is to detect behaviors associated with an adversaries presence on the network (e.g. phishing email, malware beacon, lateral movement). A wave of startups have entered this space since ~2013 promising smarter alerts and a better security and event management (SIEM) system. Companies often brand themselves as SIEM replacement, UEBA, UBA, MDR, and less often MSSP. Whereas endpoint detection has well defined problem statements and high value labeled data sets, network detection is more nebulous and has led to some of the more egregious misuses of the terms AI and ML.

Because network detection encompasses so many unique behaviors, the range of applied techniques varies significantly by provider. However, there is one common thread among many in the network detection market: unsupervised learning. As I described earlier, unsupervised learning is an approach that seeks to model the underlying structure of the data in the absence of labels. The most commonly used (and abused) phrase to describe unsupervised learning in security is that it can “find the unknown unknowns” (any Rumsfeld fans?). What is generally meant by that statement is that the model can learn “normal” from the data without being explicitly told what normal is using historical information (often labeled data, like with supervised learning). Thus, a statistical anomaly will appear which represents a behavior previously unknown to the customer.

Without vanquishing unsupervised learning as a viable strategy, let me explain some of the difficulties and why the approach is challenging in a security context. Feature generation, as described previously, is itself a critical aspect of unsupervised learning. First, the features themselves will impart bias on what can be found in the data by selecting a subset of information from which to group users, devices, etc. For example, if I chose to consider the IPs and ports that a device connects to, I’m already taking a very narrow view of the data and limiting the behaviors I can detect (e.g. without data transfer, it will be impossible to detect data exfiltration). Additionally, each of the anomalous entities will lack a proper description of why the behavior/user is strange, often just providing information like cluster characteristics or anomaly score. Finally, anomalous does not equal malicious. This is one of the most difficult concepts to impart on data scientists new to the field. Those with experience know that lots of anomalous behavior occurs on a network, purposefully or not, that is completely benign.

So why am I making these points? First, the results when considered in aggregate tend to have high false positive rates. Since the intention is to model the underlying structure of the data, there is limited room to filter noise like we can with endpoint detection or more targeted behavioral detection. Additionally, the inability to specify the reason for the anomaly often leads to frustration from SOC analysts during investigation. With the deluge of alerts SOC personal face from existing devices, adding lower information, time heavy manual investigation alerts to the queue is incredibly expensive. This combination leads to lack of trust in the solution and ultimately turning off the alerts altogether.

So how do you build a robust network detection pipeline that reduces false positives, decreases total alerts, and provides reasoning/justification for the alert? I have always advocated for an approach I call targeted analytics. This approach focuses on building analytics for specific behaviors (e.g. malicious urls, beaconing, network enumeration, etc.) rather than trying to find generic anomalous entities. By focusing on individual behaviors, analytics can be built using subject matter expertise to inform feature generation and data source usage. This helps not only improve the detection capability, but also allows for robust noise filtering that helps to reduce alert generation (e.g. filtering beacons based on ISP).

The goal is not to replace devices like firewalls, proxies, and IDS/IPS but rather to utilize that data to provide a comprehensive view of malicious behavior that may be missed by these traditional solutions. Although this approach requires more subject matter expertise and the ability to outline critical behaviors in the attack sequence, the increased fidelity of results pays significant dividends for the user. I will create a distinct post addressing this philosophy and how organizations can use it to supercharge their SOC in the future.

SOC Automation

The final notable area of ML support in security is at the SOC level. As I hinted at above, SIEM was the original promise to aggregate all relevant network and security logs to more quickly assess and stop malicious activity. If you’ve made it this far, you already know that SIEM has failed to deliver on the automation promise. As the number of devices on the network and the sophistication of actors increases, the inability to quickly triage the most important threats has become a paramount task to address in modern SOCs. This problem has opened the door to security startups promising an AI solution to automate away the human SOC as we know it. How realistic are these claims? Let’s take a deeper dive.

Most SOC automation platforms pitch the idea of automated aggregation of alerts to display only the most critical events facing the organization. So how exactly would an AI solution solve this problem? For this use case the alerts themselves are the raw data, not the telemetry from the network. Thus, feature engineering revolves around relationships among alerts in both time and substance. This seems like an obvious application for supervised learning where the SOC could label aggregated alerts entering the system to build a robust training set.

Here in lies the first major hurdle for AI SOC automation. Let’s assume we used the Mitre ATT&CK Matrix categories as features, of which there are greater than 200. We need enough data to effectively model the possible feature space (combinations of these behaviors), which is growing exponentially with the addition of each new feature (analytics folks know this as the curse of dimensionality [5]). It’s not only incredible time consuming to accurately label data in security, but even more difficult to get many examples of behavioral combinations. One solution is to simulate behaviors of known adversaries in a cyber range and inject that data into an existing data pool. It may be obvious to some readers that this creates an artificial behavioral distribution which does not reflect the attack environment, but that’s for another post. Suffice it to say, to properly build an ML solution we need to observe the actual distribution of attack techniques.

A second issue we quickly run into is deciphering between malicious/benign and low/critical severity examples of identical behavioral patterns. A real world example of this occurred at a large organization my team worked with that had Carbon Black telemetry. We saw process spawning of rundll32.exe using the the **advpack.dll “LaunchINFSection”** function followed by net.exe user commands and other persistence mechanisms which matched the behavioral signature of an advanced adversary targeting the same industry just a month before (if interested, check out [6]). After further investigation by our team and the customer, we tracked it down to a Windows upgrade and a separate rollout of a new specialized company software tool. Because the exact command line information was not available in the alert, it was impossible to make an accurate assessment of the severity of this event.

The point here is that working at the alert level is one step above the raw data where you lose important information necessary for the decision making process. A natural reaction I often see is, “well I will just add that information to the features of my model.” You are now introducing a non-numeric and effectively unbounded feature space to a classification model (this is not good). As we make the feature space more complex, the amount of data required to accurately define the underlying process grows as well (curse of dimensionality strikes again).

So you may say, “AI SOC automation doesn’t currently exist.” I haven’t vetted every tool on the market, but this is likely not a bad bet. This is a bit tongue and cheek, but the reality is that a one size fits all solution to get rid of your human SOC doesn’t exist. I would argue that ML is unnecessary for SOC automation at this stage. Given the ineffectiveness of most detection tools, the data available to any SIEM has significant gaps and false alarms. Smart algorithms for reducing the noise and highlight interesting patterns from existing devices have proven very effective (e.g. Red Canary). More robust detection capabilities to replace legacy tools helps bring higher fidelity data into the SIEM. Being able to classify commodity threats, which encompass most of the alerts SOC analysts deal with, helps to automate the triage process without ML.

I’m not suggesting that we should not endeavor to build this technology, just that it isn’t ready yet. A heuristic engine built with subject matter expertise and knowledge can help classify most threats to the network, freeing up time for analysts to manually inspect more nuanced, never before seen behavior combinations. It is unlikely that an AI solution will be able to handle these rare edge cases anyways, so the argument is really about how best to automate low level and repetitive tasks from the queue. Advanced threats will never represent an appreciable amount of the training data for a model as their tactics constantly shift, so the goal should never be to automatically classify these threat actors.

What to Ask

So now that I’ve discussed the applications, their capabilities, and disadvantages, how do you navigate the market? Your first step is to assess your security footprint, SOC maturity, and risk profile (obviously budget as well). Although some vendors may suggest that they are the only product you need, the reality is that every tool checks only some of your security boxes. It should be obvious at this point that there is no such thing as a perfect product, even an ML solution touting 99% detection.

In general, if you are a small organization with no real intellectual property, your major threats will be opportunistic (users opening malicious attachments, visiting bad websites) and thus requires strong endpoint protection with a low cost SIEM solution for compliance. Larger organizations, critical infrastructure, and companies with intellectual property concerns (e.g. semiconductor manufacturer, banks, utilities, oil & gas) will require strong endpoint protection and robust advanced network detection capabilities. I’ve been in engagements with Fortune 500 companies where we detected hundreds of malicious communications passing through their existing tools. As I said before, no solution is perfect and a defense in depth strategy is necessary. Obviously there are nuances to all of this as supply chain attacks become a lucrative approach for more advanced adversaries to use small companies as a means to breach larger organizations. The key is to understand that no product is a magic bullet and buying every product is an overreaction to fear based sales tactics. Understand what your gaps are and ask yourself if the vendor will help fill that gap.

So what questions should you ask a potential vendor? It depends on the category, but here a few that will help you dig deeper. If they are uncomfortable or unwilling to answer these questions, you should really reconsider the engagement.

Endpoint Protection

1) What data have you used to train your ML model?
2) How do you handle model decay? How often is my solution updated to take advantage of retraining?
3) Are signatures also built into the platform to block well known tactics/files?
4) Do you participate in 3rd party testing? If so, what were the false positive and false negative rates for new malware?
5) Is the solution analyzing the file pre or post execution?

Network Detection

1) Explain what you mean by AI?
2) Do you use unsupervised learning? supervised learning? reinforcement learning?
3) Can you show me examples of behaviors detected by your analytics?
4) What type of algorithms are you using and what behaviors do those techniques detect?
5) How do you balance alert fatigue caused by false positives vs the potential for false negatives?

SOC Automation

1) How is the system using AI to automate SOC processes?
2) How did you gather the requisite data to build a robust model?
3) Does the process require humans in the loop?
4) Can the system adjust to my specific needs (behaviors I care about)?
5) How does the system distinguish between severity among a specific behavior group (e.g. PUP/Adware spawning suspicious process vs. zero-day spawning suspicious process)?

This is not an exhaustive list of questions, but a great place to get started. Security is complicated and vendors should be willing to discuss their capabilities in depth without giving away intellectual property. I would recommend having a trial period to put the solution to the test. Bake-offs are a great opportunity for all involved to show off what they can do. Use-case documents are also a great way to understand a product, but be sure the vendor can describe specifically what they have detected and how they detected it with proper IOCs (IP addresses, domains, file hashes, etc.). Ultimately it’s about balancing budget and risk reduction. If the pitch sounds too good to be true, it likely is.

I hope that this post has been informative and will help you in your next vendor engagement. As always, look for partners who exhibit transparency, honesty, and humility. It’s a tough discipline and none of us are perfect, but together we can build on existing knowledge and better defend networks. I look forward to pivoting to a more technical article next time, so stay tuned.

[1] https://www.csoonline.com/article/3216765/security/heres-why-the-scanners-on-virustotal-flagged-hello-world-as-harmful.html

[2] https://arxiv.org/pdf/1807.07838.pdf

[3] https://www.fireeye.com/blog/threat-research/2018/12/what-are-deep-neural-networks-learning-about-malware.html

[4] https://www.camlis.org/scott-coull/

[5]https://www.kdnuggets.com/2017/04/must-know-curse-dimensionality.html

[6] https://securelist.com/muddywater/88059/