Machine Learning + Security: Still All Hype or a Real Concern?

Published in

LatinXinAI

10 min readAug 16, 2018

An overview of Machine Learning & Security (“MLSec”)

(Image*) lolz…because all hackers wear hoodies and AI = robots

It’s that time of the year again with DefCon and BlackHat taking place last week. Hot topics like hacking IoT devices and discussing security systems of autonomous vehicles were on the roster this year. Yet one area of security that is garnering a larger following is MLSec, the fusion of machine learning and security. An increase in research and literature like Clarence Chio’s Machine Learning and Security book and OpenAI’s paper on the Malicious Use of Artificial Intelligence earlier this year signals growing interest in the field.

More and more MLSec talks at infosec conferences are addressing ML applications on the security field. Take this year’s BSidesSF conference with Phil Roth’s talk on An Open Source Malware Classifier and Dataset or Microsoft’s Protecting the Protector, Hardening Machine Learning Defenses Against Adversarial Attacks discussion at BlackHat. This is all great signaling, but where is the MLSec field right now; how much is hype or real applications?

First off, are we talking about AI or Machine Learning?
Experts, like John McClug from Cylance, specify that the security industry is using applied artificial intelligence (aka machine learning) applications. With machine learning, a computer can learn from its inputs and decide how to act without explicit programming. Rapid developments in machine learning allow algorithms to automatically sift through large amounts of data and identify anomalies. The algorithms are self-trained, adaptable and don’t require an expert for set up. As more data is aggregated, the model can retrain itself to include new behaviors and adjust its findings. A machine learning algorithm builds a model that represents the behavior of a real-world system from data that represents samples of its behavior. Training can be supervised or unsupervised and should be a representative of the real world data, otherwise the algorithm can’t offer useful insights.

For a recent primer on neural networks and security, watch Thomas Phillips’ BSides SF Presentation Machine Learning: Too Smart for its Own Good.

Machine Learning Applications in Security

Machine learning’s main use in security is to understand what is normal for a system, flag anything unusual, and route it to humans for review. How does that work though? Algorithms are used to predict if a program is malicious based on millions of feature sets. The algorithms train on large data sets and identify what to watch for on networks and how to react to different situations. These algorithms have evolved over time and will continue to develop as new generations of malware and cyber-attacks have become hard to detect using traditional cybersecurity protocols.

With regard to solutions, machine learning can stop malware before it infects a system, while traditional antivirus systems employ a preventive measure after victims have already been affected by an attack. Traditional anti-virus systems are signature-based in which a security company identifies a malicious program, extracts a unique fingerprint for each malicious program, and then monitors customer devices to make sure that those signatures don’t appear. In practice, use cases for machine learning in security are (thus far) mainly in pattern recognition and anomaly detection with solutions in antivirus defense and malware scanning.

ML-based Security Attacks

When people hear the buzzwords “AI and security” they often think that some AI system is going to attack their computer and take all of their data. Is this a possibility in the future, and if so, how far off are we from this? Researchers are testing out ML-based applications on malicious attack methods like data poisoning and spear phishing.

Back in November 2017, Darktrace (a cybersecurity company that uses machine learning algorithms to detect and respond to cyber-threats) identified an ML-based attack. The attack used rudimentary machine learning to observe and learn user behavior patterns inside one of their client’s networks. The software was capable of blending itself into the background to be undetectable and exhibited the potential to learn to mimic behaviors. While Darktrace didn’t mention the motive of the threat, they stress that the use of machine learning in security breaches allows intruders the ability to scan networks for unpatched ports or learn the tone and writing style of someone of an intended target to eventually send automated malicious emails.

Data Poisoning
Data poisoning occurs when attackers introduce misleading data about what web content or traffic is non-malicious versus malicious. With machine learning, a poisoning attack adds poisoned instances to the training set and introduces new errors into the model. One example of data poisoning applied to email is when an attacker runs campaigns on thousands of accounts to mark malicious messages as “Not Spam” in an attempt to skew an algorithm’s perspective.

Check out Baggio Battista’s (Uni Cagliari Italy) MLSec research

Spear Phishing
Attackers are expected to take advantage of machine learning to drive more phishing attacks, in particular attacks targeted at a specific individual, also known as “spear phishing.” Spear phishing tricks people by using carefully targeted messages to install malware or share sensitive data. Machine-learning models can create hyper-realistic, machine-written messages and also churn out these messages en masse with little effort.

Clarence Chio on how a targeted attack works

Researchers from Cyxtera (a cloud security firm) built a machine learning-based phishing attack generator. They trained the generator on over 100 million historic attacks to optimize and auto-generate effective scam links and emails. Their findings revealed that the average phishing attacker can bypass an AI-based detection system 0.3 percent of the time, yet if the attacker used ML they would be able to bypass the system more than 15 percent of the time.

Go buy Clarence Chio’s ML & Security book, then watch his presentation on AI and Infosec.

Importance of Data Quality

High quality data is critical when it comes to training machine learning algorithms. Massive variation in the data causes a model to not be as effective. Experts note that machine learning is not perfect and it’s key to note that algorithms will make mistakes, especially since a model may generate false positives or fail to detect an attacker depending on what techniques were used. An additional challenge is how well machine learning can adapt to variance. At the onset of an attack, especially with malware or spear-phishing emails for instance, each attack appears different, which makes it extremely difficult to detect and classify with confidence. The availability of training data at scale also poses another problem. If you’re training a model, you need a lot of data on real attacks, which isn’t always openly available.

Open Source Training Data

With more and more MLSec data sets becoming open source, like TU Braunschweig’s MLSec, Endgame’s EMBER or PRA Lab’s (soon to be released) Secure ML Library, it’s becoming difficult to limit access of these resources to malicious users. Researchers have worked hard to reduce the blind spots in machine learning systems so they can be hardened against attacks on those weaknesses. One reason why a lot of security data remains closed, is because data may have people’s identification information or could potentially give attackers information about a company’s network architecture. The EMBER dataset, for example, was completely sanitized for researchers and defenders to work together on MLSec research and stay ahead of attackers.

Speaking of Open Source MLSec Data, watch Phil Roth’s presentation on Open Source Malware Classifier and Dataset

Doubting MLSec

Although ML is a critical component to next-generation endpoint security technologies, experts are skeptical of how quick ML will be a core component of security solutions, particularly because organizations have been focusing on the hype. While ML-based solutions can effectively identify non-obvious relationships in data sets, these solutions are still very nascent and will only work as well as they’ve been trained. Security researchers are hesitant to fully trust a machine as a lot of solutions are still flawed with high false positive rates. For now, machine learning remains a key component for security solutions, but should be used primarily to augment human decision making.

MLSec Industry is Growing

The number of research programs focusing on MLSec are growing around the globe. Carnegie Mellon has a group of researchers partnering with Symantec to make machine-learning based solutions more secure, while Berkeley’s electrical engineering and computer science department is expanding research on deep learning and security. Organizations like IEEE hosted their first ever Deep Learning and Security Workshop this past Spring in San Francisco. Similarly, institutions overseas like University of Cagliari’s Security of Machine Learning program or Ben Gurion University in Israel are growing their focus on MLSec. TU Braunschweig in Germany even launched the MLSec Project, an open-source initiative to test malicious programs on multiple datasets. On the private sector side, the number of MLSec startups is growing rapidly with loads of capital being poured into ML-powered companies like DarkTrace (raised $179.5M) and SiftScience (raised $106M). With all this growth, is the realm of machine learning and security all hype? To put it short, human involvement is still very much needed to monitor, develop, and advance the MLSec industry forward.

TL;DR MLSec realm is growing on the startup and research front around the globe. Experts advise to not believe the fear-based hype. Only a few ML-based attacks have occurred thus far. Humans oversight is still very much needed.

References

Chio, C. (2018). AI in Infosec. [video] Available at: https://vimeo.com/230187986 .

McClurg, J. (2018). Black Hat. [online] Blackhat.com. Available at: https://www.blackhat.com/sponsor-posts/06252018.html .

Crosby, S. (2017). Separating Fact From Fiction: The Role Of Artificial Intelligence In Cybersecurity. [online] Forbes. Available at: https://www.forbes.com/sites/forbestechcouncil/2017/08/21/separating-fact-from-fiction-the-role-of-artificial-intelligence-in-cybersecurity/#4ff059da1883 .

Newman, L. (2018). AI Can Help Cybersecurity — If It Can Fight Through the Hype. [online] Wired | Security. Available at: https://www.wired.com/story/ai-machine-learning-cybersecurity/ .

Brundage, M., Avin, S., Clark, J., Toner, H. and Eckersley, P. (2018). The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation. [online] OpenAI Publications. Available at: https://arxiv.org/pdf/1802.07228.pdf .

Carbon Black. (2017). Beyond The Hype Security Experts Weigh In On Artificial Intelligence, Machine Learning And Non-Malware Attacks. [online] Available at: https://www.carbonblack.com/wp-content/uploads/2017/03/Carbon_Black_Research_Report_NonMalwareAttacks_ArtificialIntelligence_MachineLearning_BeyondtheHype.pdf .

Rosenbush, S. (2017). The Morning Download: First AI-Powered Cyberattacks Are Detected. [online] WSJ — CIO Journal. Available at: https://blogs.wsj.com/cio/2017/11/16/the-morning-download-first-ai-powered-cyberattacks-are-detected/ .

Evans, D. (2018). secML Class 11: Poisoning. [online] secML. Available at: https://secml.github.io/ .

Shoutout to the Strate School of design’s Robot Research project for the photo of Pepper robot in a hoodie.

LatinX in AI Coalition

We are happy to feature Latinx in AI researchers, scientists, engineers, entrepreneurs, and writers in our Medium Publication. Thanks to Lauren Pfeifer for submitting this amazing piece to be shared with our network!

Want your work to be featured in our publication? Email us at latinxinai@accel.ai.

Check out our open source website: http://www.latinxinai.org/

Do you identify as latinx and are working in artificial intelligence or know someone who is latinx and is working in artificial intelligence?

Add to our directory: http://bit.ly/LatinXinAI-Directory-Form

If you enjoyed reading this, you can contribute good vibes (and help more people discover this post and our community) by hitting the 👏 below — it means a lot!