Machine Learning’s Promise for Cybersecurity

Owen Lystrup
Shifted
Published in
6 min readAug 3, 2016
Source: Alex Knight (@agkdesign), Unsplash

A 2011 report from The McKinsey Global Institute included the premonition that machine learning would bring a “new wave of innovation,” and soon.1 We’re all riding the swell of that wave right now.

Google Now, Amazon’s Alexa, Apple’s Siri, Snapchat, Netflix and the thousands of other services we use daily are all built on the broad shoulders of data science. They all run on systems that can learn and adapt based on how we use them. And it is astounding how many services and devices are already reasoning for us, making decisions and taking action on our behalf.

The apps, systems, services and devices that foster this tidal wave of innovation are already assimilated into our daily life, and they will further integrate in new and meaningful ways. They all create data, which will fuel a rapid advancement, adoption and use of machine learning. The more data we create with our connected toasters and self-driving cars, the more complex problems machine learning can tackle. It is already used for tasks like speech and facial recognition, natural language processing and predictive analytics. But does it apply to everything? What is machine learning’s promise for an arguably unsolvable problem like cybersecurity?

Proving Value: A Million Dollar Bounty
In 2006, Netflix showcased the potential value in machine learning through a competition with a $1 million grand prize. To win, any person or group would have to demonstrate a 10 percent improvement on Netflix’s recommendation engine. A year after the competition opened, after 2,000 work hours and a solution that combined 107 different algorithms, the first distributed prize went to a team that gained an 8.47 percent improvement. The top prize though, the million dollars, would take a different team two more years to claim.2 Netflix’s recommendation system has since become a canonical study in academic environments.

Within cybersecurity though, machine learning is still fairly green. Some in the industry are even a little skeptical about its promise to make us all more secure. And not without reason. Security brings unique challenges that rival even cancer research. Cancer cells, after all, do not actively evade biomedical researchers.

Screen Shot 2016-08-03 at 1.17.32 PM

Hackers and online criminals are in a constant cat and mouse game with law enforcement and the companies they attack. Each attack has its own complexities, characteristics, malware types, attack methods, registration info, IP addresses, and hosting infrastructure, much of which may only be used or seen once and then discarded.

The good news here is security companies and law enforcement now have unprecedented volumes of data on their side. A Cisco report earlier this year declared that we have reached the zettabyte era, meaning that, by the end of 2016, we as a connected species will generate one trillion gigabytes of IP traffic.3

Data science and machine learning experts argue that, at this scale, it is not feasible for any human — or any team of humans for that matter — to keep pace with cyberattacks in the connected reality in which we live. Not without help from data science.

The Big Race to Big Data Security
The space for machine learning startups is growing fast, really fast. Dozens have popped up in the last half decade or so, looking to solve problems with data models and algorithms. Microsoft Ventures alone is fostering 10 startups in the industry.4 And CrunchBase’s directory lists 865 startups and investment firms in the machine learning category. More than 30 of them list “security” as their focus area.5 But just a few years ago, this was not the reality.

“In 2011 at [the RSA security conference] there were virtually no startups in the machine learning field — maybe two or three,” Cisco Principal Engineer Martin Rehak said. “By 2013, there were maybe five or six, and I knew all the founders personally. Now, I don’t even know how many because everyone is talking about it.”

Many security startups, and even some established companies that are looking to expand their offerings, are seeing opportunity in using data science to make customers more secure. But some believe the rust to offer a machine learning solution may flood the marketplace ineffectual products that promise more than they can deliver.

A Layered Sifting Process
“Once someone says I am going to use neural networks or deep learning and solve cybersecurity,” Rehak said, “you know that person is lying.” He added that it’s never as simple as taking a problem, a dataset and applying machine learning to solve it. The approach has to be more layered, with multiple algorithms and data models applied a large data set.

Rehak is an authoritative name in the machine learning field. His team at Cisco was one of the first to use data science to detect and classify attacks. He started the company Cognitive Security in 2009, which was later acquired by Cisco in 2013. And since then the team at Cisco has been specializing in spotting network anomalies and malicious behavior, then categorizing the bad events, grading their severity, and even predicting what a piece of malware may do next. His team, called Cognitive Threat Analytics (CTA), just released a new dashboard that gives a visual spot check on a company’s security health.

Screen Shot 2016-08-03 at 1.01.55 PM

Key to his work, he said, are the enormous amounts of data to which his team has access. CTA analyzes about a billion NetFlows daily. Each NetFlow, Rehak said, goes through about 30 to 70 different data models to “clean” the data. This process returns about 70 values to check and get an initial sense of the NetFlow’s legitimacy. From there, the tool can use its historical knowledge of past user and network behavior to give context around the traffic, which Rehak said cuts down on the false positives that other tools might flag.

The process can be likened to water purification, his said. The team even uses the measurement PPM — or parts per million — when enumerating threats. Only the process is a bit reverse, because instead of filtering out dirty water, Rehak and his team want their data models to filter out the legitimate traffic and leave the malicious traffic for analyzing.

This refining process is key to getting the desired results, but it is also the art of behind the science. That is to say, there’s no standard way of solving a given problem, so each approach might be different. Data scientists find different paths to a solution, using different models and schools of thought within machine learning to train and tune systems to be smarter and return desired results. To some, this interpretive way of getting solutions is also what makes certain processes in machine learning a bit of a black box.

The Dark Art of Machine Education
Of the startups entering the field, a number of them are as-a-service platforms, which offer data gathering services, and machine learning technology like anomaly detection, visualization tools, predictive analytics and others. But according to CSIRO Chief Scientist Bob Williamson, it’s important to show caution around these types of services.

Williamson told an industry panel in May that machine learning and data science solutions are “very technique driven.”

“Pretty well every provider of analytics solutions will say ‘look at the techniques I’ve got — I’ve got some core vector machines, I invented one of the core vector machine algorithms, it’s a great technique’,” he said. “It’s still a technique. How do you know for your problems that it’s useful? You don’t.”

For some however, especially the data scientists in this field like Rehak, machine learning nevertheless holds great promise for making the Internet more secure. Some argue in fact, it’s not just possible that machine learning will improve security; it’s inevitable.

Part two of this series will discuss the different models and approaches that are making this all possible; how to separate the pretenders from the real deal; the differences between supervised and unsupervised learning; the ubiquitous “false positive” metric, and more.

Stay tuned.

Special thanks to OpenDNS Security Researcher Jeremiah O’Connor, Martin Rehak, and Dazhuo Li for their participation in this article.

Sources:
1. http://www.mckinsey.com/business-functions/business-technology/our-insights/big-data-the-next-frontier-for-innovation
2. http://techblog.netflix.com/2012/04/netflix-recommendations-beyond-5-stars.html
3. http://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual-networking-index-vni/vni-hyperconnectivity-wp.html
4. http://www.geekwire.com/2016/microsoft-seattle-accelerator-startups/
5. https://www.crunchbase.com/category/machine-learning/5ea0cdb7c9a647fc50f8c9b0fac04863
6. http://www.cio.com.au/article/600547/machine-learning-still-cottage-industry/

--

--

Owen Lystrup
Shifted

Digital Content Director for Western Digital.