A Phishing Guide: Lessons Learned on the Journey to Detecting Phishing Domains

Jonathan Ticknor
security analytics
Published in
8 min readJan 24, 2019

Welcome to the first post of a series that will chronicle my personal experiences building detection capabilities for some of the largest organizations in the world. When addressing detection capabilities, phishing is often the first subject that comes to mind and is an essential tactic used by proficient adversaries. Hence, the series will start here with A Phishing Guide: Lessons Learned on the Journey to Detecting Phishing Domains.

Here I will highlight my philosophies and document many of the common pitfalls to which the industry has succumb to. Opacity and lack of in-depth knowledge of phishing techniques has led to the general mistrust of advanced analytic vendors and practitioners in today’s market. This article will review phishing’s historically successful approaches and clarify the importance and techniques of building detection models to defend against these malevolent methods. I hope that the following insights are useful for those looking to build a robust detection platform and helpful to those working to figure out what the new “Advanced AI” vendors actually do (often this is actually not AI) and are trying to accomplish in the future.

The first thing I want to do is define what I mean by phishing detection. Phishing has broad connotations in cyber security, including malicious email, attachment, and domain detection among others. For this post, I want to focus on malicious domain detection because it’s an important component for any corporate security team and can be built using readily available data. Anyone with WebProxy or DNS can begin to detect malicious traffic in their own environment, and with a little extra effort and ingenuity, can use open source data to build their own domain threat intelligence feed.

An example of a Microsoft One Drive phishing page

A savvy reader is probably aware of the many academic articles, vendor use cases, and Github repositories claiming to detect >95% of malicious domains with incredible false positive rates (often below 0.1%). By no means am I the expert on phishing detection, nor do I have all of the answers on how to implement a detection strategy. But my experience in the field, particularly related to phishing, tells me to first critically assess any solution that purports “too good to be true” detection metrics. There are two critical questions that must be asked when assessing any solution: 1) What data was used to build the model? and 2) What data was used to test the model? This may seem trivial, but mistakes at this stage lead to unreasonable expectations upon deployment, and ultimately customer/user disappointment. I don’t believe biased data sets are used on purpose, but are often the result of limited domain knowledge or lack of access to robust data. Let’s first jump into a taxonomy of phishing techniques, and then we can better understand how bias can be introduced through improper data collection.

There are 7 common approaches that I am going to introduce below to get started. For the expert reader, I am aware there are more but highlighting the most common techniques will help the broadest audience.

1. Domain generation algorithms (DGAs): A technique used by malware authors in which an algorithm is used to constantly change the command and control (C2) domain to avoid blocking mechanisms. This technique has fallen out of favor as even the most rudimentary security tools often provide DGA detection.

2. Well known brands: A popular approach to phishing domain generation that is exceptional at tricking users into believing they are accessing a legitimate site owned by Apple, Google, Bank of America, etc. In the age of smartphones with small url bars on the screen, the use of brand names has been wildly successful at stealing social media, personal mail, and financial credentials.

3. Well known keywords: A popular approach used in spam and the classic “You have a virus, call 1–888-” tech support scams. By using keywords like virus, detected, and login the adversary creates fear in the user that the machine has in fact been compromised.

4. Homoglyph attacks: A powerful technique used by adversaries to trick users into clicking by making very minor changes to a popular domain that trick the user into “seeing” the proper domain. This technique has become more difficult to deploy as large corporate security teams register many of the most successful homoglyphs via protection providers like MarkMonitor.

5. Punycode: An old trick that has become more of an emerging threat. Again, an approach that uses the appearance of the domain to trick users.

6. Legitimate hijacked pages: Among the more difficult techniques to detect, particularly for models that use enrichment from WHOIS data to filter false positives. In this approach, an adversary compromises a legitimate website and hosts malicious subdomains or folder structures leading to malicious pages. It is common to see this approach paired with one of the techniques above (particularly brand and keyword spoofing).

7. Standard domain: The most difficult technique to detect and one often employed by more advanced actors. The domain generally has no discernible lexical features that make it unique. To prevent detection via WHOIS data (registration time, owner, registration country, etc.), the domain is registered in advance of the attack.

Examples of phishing techniques

Now that I have defined a taxonomy of common techniques, I can start to consider possible solutions. The first challenge is acquiring a sufficient amount of useful data to build a detection model. I need a substantial set of confirmed phishing domains as well as an equally sizable set of known benign domains. If your first instinct is to get the Alexa or Umbrella 1 million lists, fight the urge. Although these lists can be particularly useful in filtering false positives, they are not strong datasets for building a robust detection model.

First, I want examples of legitimate traffic to domains with multiple subdomains (which you won’t get from the Alexa 1M). Second, I want a representative dataset for lexical analysis. If the model thinks that character transitions related to word concatenation in domains is rare among our benign set, there will be an issue. Although uncommon in the top 1M datasets, this behavior occurs frequently in domains people visit on a daily basis (think auto dealers, hospitals, local shopping). Finally, I want a broad geographic diversity in domains, particularly if I am building the model for an international company (e.g. flagging every Brazilian or Polish site that has unique lexical characteristics won’t work). I’m not going to give away all my techniques, but there are ways to get these robust benign datasets for free, e.g. Common Crawl, customer historical data.

For malicious domains, there are a number of resources I can use to build a robust dataset. Some of the more common ones include PhishTank, hpHosts, FireHOL, and VirusTotal (requires a license to gather many domains). Once I gather these lists, it is critical that a filter is applied to remove duplication (particularly at the second level domain). I have seen many training datasets used with significant duplication of malicious infrastructure which led to artificially high detection rates reported (500 examples of *.baddomain[.]com in the training and test sets). It is also important that I try to gather as many possible examples of the seven techniques listed above to understand how my model will perform against a variety of adversaries.

Now that I have collected enough data, it is time to build a model. But let’s hold on for a second and think about the seven techniques I highlighted above. The first technique, DGAs, is a trivial task to solve as I mentioned earlier. We can either hand engineer features and use a random forest model or use a deep learning model with character embeddings to detect this behavior. Techniques 2 & 3 rely on tricking users, and although e.g. they are often abnormally long, use .com in the middle of the domain, and include dashes, there are many examples that don’t use these obvious indicators. It should be obvious that the character embedding DNN we used for DGAs may struggle detecting these techniques(spoiler alert, it will). Technique 4 goes a step further and attempts to look just like a legitimate domain. You may say, “but wait, if the user switches an l for a 1 in google[.]com or adds an extra letter that will be lexically strange.” However, if you analyze enough traffic you will realize there are plenty of legitimate domains that use strange lexical features just like that. This will lead to false positive issues using ML models to detect this technique. Techniques 5 & 7 go a step further, becoming nearly impossible to detect using lexical based models.

Example WHOIS record used for enrichment

We can go a step further and begin to add enrichment to our domains. For instance, we can add registration date, hosting IP, hosting ISP, and hosting country to our lexical features. Although at first glance this seems like an obvious solution to the lexical feature problem, I need to address how expensive it is to gather these features. At a typical Fortune 500 customer, the number of never before seen unique domains often exceeds 500,000 in a given day. A typical service like whoisxml offers 2 million WHOIS lookups per month for ~$1,800. That license alone can be cost prohibitive for some organizations (highly recommend these services though, they save on a lot of headaches). On top of that, as new privacy laws are enacted we are seeing less registration information available, reducing the efficacy of the enrichment. I generally recommend smart filtering before enrichment to help reduce the number of domains analyzed despite the risk of increasing the false negative rate.

This is not to say that ML, and in particular deep learning, with or without enrichment do not have their place in security analytics or malicious domain detection. ML is a really awesome hammer, but not every problem is a nail. In the case of phishing domain detection, I am arguing for a multi-analytic solution to protecting the environment (which is what my team currently deploys). Anyone who has worked with SOC analysts understands the impact false positives have on their workflow. We have to remember that each analytic we build is just one alert among hundred or thousands of unique alerts the SOC will see on a given day. It is essential that we do the heavy lifting before it get’s to them, increasing the chances that our alerts will be used. I won’t spell out exactly what my team has implemented, but here are a few cool ways other companies/individuals are trying to solve this challenge:

DGAs: random forest model on hand engineered features, deep learning models

Well known brands/keywords: heuristic models, keyword detectors (see Cisco Umbrella)

Homoglyph attacks: dnstwist (https://github.com/elceef/dnstwist)

Punycode: https://www.endgame.com/blog/technical-blog/detecting-phishing-computer-vision-part-1-blazar

Hijacked domains: heuristic models, domain length, etc.

Standard domain: WHOIS data, hosting information (major challenge)

As you can see, we need a broad range of approaches to tackle the ingenuity of phishing attacks. I think it’s critical to ask a vendor how they are detecting these types of domains, what they are doing to reduce false positives, and how they plan to adapt to adversary tactics. This is a very challenging problem, particularly as adversaries adjust their behavior with the knowledge of our detection techniques.

I hope you all found this read to be educational or, in the least, a bit entertaining. My next post will focus on AI/ML for security in the more general sense, sifting through the hype and providing an understanding of the current state of the market. Future posts will include in-depth technical analysis (some with code) as well as more general state of the discipline analyses. Stay Tuned!

Any feedback is more than welcome. Thank you for your time.

--

--