5 Dangerous Misconceptions about Using Machine Learning in Cybersecurity

By Dr. Deepali Arora, Lead Data Scientist at Cmd

Published in

Cmd Security

7 min readOct 18, 2018

Use of machine learning (ML) to derive business value is the new norm in the technology sector. As per Information Data Corporation, the artificial intelligence (AI) and machine learning market is expected to grow from $8 billion in 2016 to $47 billion by 2020. Every industry is adopting machine learning to gather intelligence at a rapid pace. Buzzwords like AI, ML, deep learning, etc., are making their way into advertisements to sell products and attract investors. The cybersecurity industry is no different. Here are just a few quotes from cybersecurity companies that describe how ML fits into their core product offering:

With so much hype around machine learning, companies are rushing to find ways to include machine learning in their product and marketing. In this process, often there are crucial subtle details that are overlooked. In some industries the consequences of these oversights might be tolerable; but in the world of security, even a small error can be disastrous.

What needs to be considered when using ML in a cybersecurity context? In this article, I introduce five common misconceptions held by cybersecurity companies when attempting to incorporate ML into their product offering.

Misconception 1: Open source, well-defined datasets are readily available for designing new ML based solutions

Many organizations, in both the government and private sector, have made their datasets readily available for research purposes; however, it is still difficult to find high quality data, especially data you’re hoping to use to develop models for security. Even if you manage to find data that you believe may help you in training your models or building new models, you will still need to put this data through a rigorous vetting process to determine answers to questions such as:

Do I have enough data to provide meaningful insights for our applications?
Does this data provide me all the variables needed to design a certain model?
Should I use these variables in the current format or could I derive more meaningful information from these variables by combining them together into something new?
Is there any relationship between these variables, i.e., are they correlated or uncorrelated?
Are there any missing values? If yes, what technique should I use to uncover these values?

Even if you use all the best techniques to get the open-source data ready for design of models, at the end of the day the practice of using open-source data when developing cybersecurity ML models can end up doing more harm than good. Why? In the same way that we use our knowledge of ML to keep companies safe, it’s important for us to remember that attackers are also busy developing new techniques and models to bypass our protections. These attackers often also make use of both ML and open source data to design new and innovative techniques to carry out malicious activity. Therefore, in order to design ML models for security, one must be very careful in selecting and preparing any publicly-available data before feeding it into models they’re designing to prevent cyber crime. Ideally, one must completely avoid using open-source datasets and instead use real clients data for whom the models are being designed.

Misconception 2: A data scientist can begin building new ML models on day one

Often, companies know they want ML models but don’t know what goes into actually creating them. They’ll hire ML experts and want them to immediately start creating new models. ML models are highly dependent on the nature of the data and type of problem one is trying to address. If the models designed are based on the use of wrong or outdated data then they may not be of any use. After all, “garbage in” leads to “garbage out.” Identifying the data to be used to address specific problems, understanding the nature of the data and then finding the algorithms that would work best for such data is crucial before designing the final ML models. This process takes time upfront and it is important to understand your data before jumping into the selection of ML algorithms. No doubt, ML algorithms will yield results but the question is how valuable are these results?

Misconception 3: ML algorithms that work today will also work in the future

In cybersecurity, data is non-stationary by nature. To continue keeping systems safe from hackers, security engineers are required to continuously improve the techniques they’re using to keep chasing the moving target of identifying malicious activity. This means the variable types and/or values are constantly changing. For these reasons, designing ML models using old static data can actually do more harm than good, as accomplished attackers can easily figure out strategies to bypass detection. Moreover with new and different types of malware being created by cybercriminals on a regular basis, ML models developed for cybersecurity need to receive continuous training in the form of updated data to ensure they’ll continue to detect suspicious activity as the nature of that activity continues to evolve.

As your organization’s security strategies and tech stack continues to grow and evolve, your ML models will need to change, too. Models that were developed to monitor behavioral analytics on a single machine may not be applicable to identify anomalies in data residing across the cloud. Similarly, techniques that may work for datasets with limited variables may start exhibiting their limitations when the size and volume of your data set expands. Therefore, ML models will require constant testing, iteration, and tweaking in order to efficiently scale (and some might not scale at all).

Misconception 4: ML algorithms are safe and robust

In the world of security, ML models aren’t “set and forget.” To detect and defend against the adversarial attacks on machine learning models, we need to first identify the possible motive(s) of attackers. Commonly, attackers are either attempting to avoid detection (known as “evasion tactics”) or they’re looking to manipulate the detection capabilities of vendor or company-specific ML models (known as “poisoning attacks”). Our models need to be capable of thinking faster than our attackers can pivot. We need to focus on high quality events capable of detecting our adversaries.

Furthermore, the results obtained from ML models need to be thoroughly analyzed for false positives, as false positives can have severe consequences, especially in the security domain. Improving, training, and re-designing models on a continuous basis requires an intimate understanding of the limitations of existing models.

Misconception 5: Data scientists and security experts don’t need to interact

Of all of the misconceptions, this last point is perhaps the most dangerous. Data science is a newly emerging field, meaning the pool of employable data scientists is still quite small. This means that often, data scientists that work for companies in a specific industry may not be experts in that particular field. This by itself isn’t necessarily a problem. But if data scientists attempt to design ML models without understanding the real logic behind the problem at hand, their models could end up doing more harm than good. Security engineers, on the other hand, usually don’t have a true understanding of data science and are unaware of the ways that ML could ease their workload by replacing traditionally manual processes. This lack of communication within an organization can lead to problems, friction, and overall inefficiency.

Ideally, companies should look for security engineers with some background and understanding of data science and for data scientists that possess an understanding of cybersecurity, but finding people with these combined skill sets is very rare today. Therefore ensuring that your data scientists and security experts communicate and share knowledge with one another is essential. Security engineers can share more insight into the problems that they are trying to address, while data scientists can help security engineers to understand the real potential around ML when it comes to handling large complex datasets. Together, they can build solutions to address today’s complex cybercrime landscape.

Conclusion

In this article I have talked about five major misconceptions of using ML in the security domain. To properly protect computer systems against cyber crime, security companies should take advantage of ML technology to build better, faster and more efficient security solutions. But it’s imperative that ML is used in the right way, with the right considerations.

At Cmd, we have managed to form a team of highly qualified, multi-talented individuals with backgrounds in both security and data science who are working together to develop innovative security solutions for Linux systems. As Cmd’s lead data scientist, you can find out more about my academic and professional background in my previous article on the Cmd blog. Stay tuned, as I’ll be sharing more about how we’re using machine learning within our own software in the coming weeks.

. . .

About the author

Over the course of Dr. Deepali Arora’s rich career as a data scientist, she has created machine learning models that affect the operation of many companies around the world today.

Holding a PhD in electrical and computer engineering, she has published over 30 papers in some of the world’s top journals and conference proceedings. Her machine learning breakthroughs have saved companies millions of dollars in resources and continue to impact the lives of millions of users daily. Today she works as the senior data scientist for Cmd, a proactive Linux security company whose software protects the sensitive data of some of the world’s top enterprise organizations.