Researchers Apply Machine Learning to Catch Phish Before They Hatch

Owen Lystrup
Shifted
Published in
4 min readNov 21, 2016

Security researchers at OpenDNS have figured out a way to apply machine learning and automation to shorten the verification of phishing sites to mere seconds, and block others that have not even launched.

Phishing is a swift moving business. Criminals, when seeking to harvest credentials or spoof an executive’s e-mail to request a fraudulent transfer of funds, do not stage their attacks to linger very long. Relying on cheap domains with convincing looking login pages, phishing sites live typically only a few days or sometimes hours before they disappear or move to a new URL.

This makes the task of tracking and blocking phishing sites incredibly difficult, researchers at OpenDNS say, but it’s where machine learning models and automation can provide a lot of help.

The Security Soft Spot

Even as we all rapidly evolve how we communicate, with more and more people opting for mobile messaging and social media channels, age-old e-mail remains at the top, especially for business. And phishing remains one of the top attack vectors.

Clicking

Research from PhishLabs recently showed that although the industries targeted and methods of attack have fluctuated and changed over the years, phishing remains one of the top choices for stealing usernames and passwords.

“Organizations today are spending far more on preventing, detecting and responding to cyberattacks than ever before,” PhishLabs wrote in a research report. “But amid all of this change, the use of phishing to exploit the people that use the technology continues to be the most effective way to attack organizations and individuals.”

It remains a truth in cybersecurity. That no matter how much a company invests in security products, no matter how much training an organization puts its employees through, you cannot stop users from clicking. Call it curiosity or impulse, blue hyperlinks in a targeted message will always entice a gullible cursor.

The Power of Automation

The research team at OpenDNS Labs has for years managed a public phishing database called PhishTank that has served as a positive resource for security professionals to use for purposes like building block lists. Anyone can submit a link they suspect is a phish to PhishTank. From there, a group of approved moderators looks over each submitted site. False positives are weeded out manually as a result.

In the new process, links submitted to PhishTank will first go through an automated verification process. The submitted URL is checked against pre-existing whitelists and ASN filters that OpenDNS manages, which will weed out false positives and spam. If it passes that check, the algorithm will then scrape the URL’s source code and page content, which is then put through a machine learning model. That model will check the source code, images used on the site and the site’s language, among other elements, and compare them to a corpus of curated data. The model will then assign a score, and if it is above the threshold, put it in the block list.

This is all processed in a matter of seconds. Whereas it might take a handful of human moderators days to get to a URL and verify it.

“When a link is submitted, NLP Rank will verify the phish and if it has a score of, say, 97 percent or higher, it gets automatically blocked,” OpenDNS Security Researcher Jeremiah O’Connor said in an interview.

Stereotype the Netblocks, with Prejudice

Predict

To take the model one step further, the team conducts ancillary searches through WHOIS info, registrant data, IP addresses and host information to find out what other sites the same hacker might be using, now or in the future.

“We are truly predictive now,” O’Connor said, “meaning we are vetting URLs that are submitted from the community and finding out that the model has already blocked them, sometimes months before.”

If the right clues can be correlated, the learning model applied can open up an entire host of bad URLs and IP addresses that could be blocked in addition, even if they are not currently in use. Imagine a game of Minesweeper. Click the right square, and suddenly the game board opens up to reveal the scale of treachery around you.

[caption id=”attachment_783" align=”alignright” width=”250"]

Image src: Imgur.com[/caption]

“By taking the verified results of the NLPRank process,” researchers wrote in their blog post, “and pivoting through their server IPs using Investigate, we are able to uncover handfuls of other registered phishing domains acting as targets for the very phishing campaign that was initially discovered.”

O’Connor and his team call this process of pivoting Rogue Infrastructure Classification. It’s a way of combining all sorts of associated domains, IPs and WHOIS records from known, bad domains and automating the process of finding the rest before they’re even used.

“We can predict the infrastructures phishers will use as they are being setup, and we are now blocking phishing sites even before they go live with spoofed content,” the team wrote.

For more info on the data models used and to see examples of results, check out the OpenDNS Security Labs blog here.

--

--

Owen Lystrup
Shifted

Digital Content Director for Western Digital.