URL Phishing Detection using Machine Learning and Graphs

Avyah Sharma

Published in

ACM at UCSD

7 min readJan 14, 2022

A Basic Application of Mathematics in Cyber Security

Look, two clowns looking at each other! — credit to Ishaan Kavoori

What is Phishing?

Phishing is a social engineering cybercrime where the attacker attempts to lure confidential information from a target, often by impersonating a more reputable organization or individual. Various phishing attacks primarily revolve around a hacker spoofing a website or sending messages, so that the they may have the ability to do anything ranging from instantiating malicious software like spyware to gaining access to secure information through a backdoor.

URL phishing, the practice where hackers implement fraudulent websites meant to deceive the target into revealing sensitive data by aiming to appear like a legitimate institution, will act as the center of this article. By mapping the ego network of a given webpage, it is possible to create a binary classification model that predicts whether or not the sourced webpage is malicious.

Designing the Model

The model designed takes a series of URLs strictly in the form of “https://www.example.com” and assigns it a binary label. For example, the source website is marked 1 if malicious or 0 if safe. To start, observe what techniques make phishing attempts successful. In other words, what patterns in hyperlink manipulation are commonly utilized to not only remain undetected from phishing detectors, but also seem authentic. Authenticity is defined to be the ability to masquerade as a target or legitimate website. On the other hand, remaining undetected relies more on the linking structure of the source webpage. These two characteristics will be the main basis for how the features for the classification model are defined.

Sourcing the Dataset

Malicious URLs were sourced from PhishTank and OpenPhish, collectively. Safe URLs were based on the most commonly visited websites on the World Wide Web. All of these URLs were downloaded and saved into a CSV file. Next, a Pandas dataframe was created from the CSV filed containing all the links along with their respective label. Now, any URL that was not valid was removed from the dataset. This includes the following: URLs not in the correct format, phishing links that were no longer functional, any URL resulting in a status code greater than 200, or links that were yielding timeout errors when connecting. The resulting dataset contained 300 malicious and 300 safe URLs. Lastly, the python Requests library was used to fetch the html links from the given URLs. Those links were then saved into another CSV file where the file contained a series of dataframes.

Building the Networks

First, a pandas dataframe was created from the CSV file containing all the queried URLs. After splitting the dataframe into multiple dataframes where each dataframe represented the source URL and its respective links, the program used NetworkX to create a graph from the edgelist. The resulting graph represents the ego network, as a whole, with the source URL as the center node. The surrounding nodes are represented as webpages, while each directed edge symbolizes a hyperlink connection. Then, the graph was colored using following algorithm:

for each node in G do
 if source is node URL then
  color node red
 else if source and node URL have same domain then
  4 color node orange
 else if source node URL have same subdomain (but same domains) then   
  color node yellow 
 else if node URL is invalid then 
  color node green
 else ▷ likely other valid URLs or links to other webpages color 
  node blue

To put it simply, the algorithm iterates through each node assigning it a color depending on its relationship with the source node. For instance, the domain, subdomain, and validity are all factors that determine the color of any given node. Note, the ordering of the nodes does not matter and the center node is always colored red being that its the source node. Consider the colored ego network of https://www.google.com as an example:

Binary Classifier

By exploiting the properties of the network structure, a series of features can be extracted, so that a classification model may be trained. In this list, the formula for each feature will be shown along with an explanation as to why that chosen attribute is implemented within the classifier. Each feature that shares a similar justification will be placed into a category for strictly readability purposes. In total, there are 11 features.

Category 1

Percent of Orange Nodes

Percent of Yellow Nodes

Percent of Orange and Yellow Nodes

Reasoning: Each of the percentages listed can be used to determine how dominant the source domain or its respective subdomains are within the ego network. Websites with higher percentages of orange or yellow nodes tend to be more legitimate. Additionally, the percentage of orange and yellow nodes are also calculated, because authentic websites may also have a percentage of links that share a domain or subdomain. To put it simply, they have many domains and subdomains.

Category 2

Percent of Green Nodes

Percent of Blue Nodes

Percent of Green and Blue Nodes

Reasoning: A high percentage of green or blue nodes may suggest that the website is implementing some level of hyperlink manipulation. As a matter of fact, certain phishing websites may have a series of unnecessary links added to seem far more believable to unsuspecting users. This approach is very popular due to its ease of implementation. [3] Notably, it also critical that the percentage of blue nodes is calculated, because it is not guaranteed that blue nodes represent malicious external URLs. Likewise, green nodes may just represent URLs that are experiencing downtime. Although, this is incredibly unlikely and is usually a sign of suspicious activity.

Category 3

Size of Green Nodes

Size of Blue Nodes

Reasoning: It is essential that the difference between percentages and number of total external or invalid links is accounted for within the model. A safe website may have a high percentage of blue nodes, but that does not necessarily correlate with the URL being malicious. These features allow the model to consider instances where there are few blue nodes, despite having a high percentage. In addition, websites with a strangely high number of null links or redirects is another to reason to suspect malicious intent.

Category 4

Out-Degree of Source Node

Reasoning: More legitimate webpages tend to have arbitrarily reasonable number of links. If a webpage has an unusually high number of hyperlinks, this usually suggests a form of hyperlink manipulation.

Category 5

Out-Degree Centrality Mean

Density

Reasoning: This density feature is similar to Google’s PageRank feature in the sense that it allows us to relate the total number of edges to the given number of nodes. According to the dataset, malicious URLs tend to be denser than safe URLs

Training the Model

After using NetworkX to perform each of the listed calculations, all the results were then saved to a CSV file. Each row in the CSV file represented one URL with each column being a feature. The last column was labeled 1 or 0. Now, there exists a proper dataset and the machine learning model is ready to be trained. Subsequently, the data was split into test data and training data. XGBoost and scikit-learn were the libraries used to create the supervised model. The code for the whole process can be found on https://github.com/avyahsharma/phishing-detector.

Results

Utilizing sklearn.metrics, a number of metrics were used to measure the performance of the the model. An image containing some relevant statistics is attached below:

Each metric was calculated using the given formulas such that TN is True Negative, TP is True Positive, FN is False Negative, and FP is False Positive.

Improvements

A number of improvements can be made to optimize this model. To start, consider the Jaccard Score. While it seems unreasonable, most hackers employ the same techniques when designing phishing websites, hence why the dataset is so similar.

Another key improvement would be adding the ability to query for an additional level of webpages. By scraping for more hyperlinks, a larger graph can be built, thus allowing the model to collect more information on phishing websites that can better maneuver around detectors. Competing models implemented a similar approach and they received stronger results, because it also allowed them to test for more features. These include developments like a stronger PageRank algorithm, checking if graphs are semiconnected, and searching for more loops. In fact, when competing models included these features specifically, they noticed an increase in performance depending on the machine learning technique used.

As mentioned before, the code can be found at https://github.com/avyahsharma/phishing-detector. Any changes or improvements would be greatly appreciated; kindly leave comments, questions, or suggestions and I’ll do my best to reply!

[1]: Jones, Caitlin. “50 Phishing Stats You Should Know in 2021.” Expert Insights, 26 Oct. 2021, https://expertinsights.com/insights/50-phishing-statsyou-should-know/.

[2]: “How to Recognize and Avoid Phishing Scams.” Consumer Information, 18 Oct. 2021, https://www.consumer.ftc.gov/articles/how-recognize-and-avoidphishing-scams.

[3]: “What Is Phishing? Examples and Phishing Quiz.” Cisco, 29 Oct. 2021, https://www.cisco.com/c/en/us/products/security/email-security/

[4]: Tan, Choon Lin, et al. “A Graph-Theoretic Approach for the Detection of Phishing Webpages.” Computers & Security, vol. 95, 2020, p. 101793., https://doi.org/10.1016/j.cose.2020.101793.

[5]: Aleksandersen, Daniel. Most of Alternate Web Browsers Don’t Have Fraud and Malware Protection, 16 Aug. 2016, https://www.ctrl.blog/entry/fraudprotection-alternate-browsers.html.

[6]: Page, Lawrence, et al. “The PageRank Citation Ranking: Bringing Order to the Web.” Stanford InfoLab Publication Server, Stanford InfoLab, 11 Nov. 1999, http://ilpubs.stanford.edu:8090/422/.

[7]: “Introduction to Boosted Trees.” Introduction to Boosted Trees — Xgboost 1.5.1 Documentation, https://xgboost.readthedocs.io/en/stable/tutorials