A Comparative Study of Algorithms for Phishing Website Classification — part 1

7 min readJul 14, 2023

Introduction

The Internet has become an essential part of our lives. We use it for everything from shopping to socializing to working. But with the increasing use of the internet, there are also increasing risks. Cyberattacks are becoming more sophisticated, and personal information is more valuable than ever before.

There are different kinds of cyber attacks that happen around the world. few of them are listed below.

Malware: This is software that is designed to harm a computer system. Malware can take many forms, including viruses, worms, trojans, and ransomware.
Phishing: This is a type of social engineering attack that is designed to trick users into giving up their personal information. Phishing attacks often involve sending emails that appear to be from legitimate sources, such as banks or credit card companies. The emails will often contain a link or attachment that, when clicked, will take the user to a fake website that looks like the real website. Once the user enters their personal information on the fake website, the phisher can steal it.
DoS (Denial-of-Service) and DDoS (Distributed Denial-of-Service) attacks: These attacks are designed to overwhelm a computer system with traffic, making it unavailable to legitimate users. DoS attacks are typically launched from a single computer, while DDoS attacks are launched from multiple computers.
SQL injection: This is a type of attack that is used to exploit vulnerabilities in databases. SQL injection attacks can be used to steal data from databases, modify data in databases, or even take control of databases.
Zero-day attacks: These attacks exploit vulnerabilities in software that the software vendor is not aware of. Zero-day attacks are often the most dangerous type of attack because there is no patch available to protect against them.
Man-in-the-middle (MITM) attacks: These attacks are designed to intercept communications between two parties. MITM attacks can be used to steal data, modify data, or even take control of communications.
Ransomware: This is a type of malware that encrypts a victim’s files and demands a ransom payment in order to decrypt them. Ransomware attacks are often very successful because victims are often willing to pay the ransom in order to get their files back.

In this blog we are going to focus on one such attack which is known are Phishing.

Why is it important to detect a phishing website?

Phishing websites are designed to trick users into entering their personal information, such as passwords, credit card numbers, and social security numbers. Once this information is obtained, the phishers can use it to commit identity theft, fraud, and other crimes.

The financial losses caused by phishing attacks are significant. In 2021, the FBI’s Internet Crime Complaint Center (IC3) received over 241,000 complaints about phishing attacks, with losses exceeding $54 million. In 2021, the average loss per phishing attack was $136. Over 90% of data breaches are caused by phishing attacks.

That’s a lot, isn’t it!!!!

So we must protect ourselves from visiting such phishing websites.

In our project, we are developing a machine-learning model to detect whether a website is phishing or not.

Problem Statment

The objective of this project is to develop an accurate and efficient system for classifying websites as phishing or non-phishing based on a given set of characteristics. Phishing websites are malicious online platforms that aim to deceive users into revealing sensitive information, such as login credentials or financial details, by imitating trustworthy websites. Detecting and identifying phishing websites is crucial in safeguarding users from potential fraud, privacy breaches, and financial losses.

The Dataset used to build the machine-learning models is from the UC Irvine Machine Learning repository.

UCI Machine Learning Repository

Discover datasets around the world!

archive.ics.uci.edu

Mohammad,Rami and McCluskey, Lee. (2015). Phishing Websites. UCI Machine Learning Repository. https://doi.org/10.24432/C51W2X.

Data Cleaning

Data that was present was in the form of arff format. we had to convert the data into the proper format for further use. So the first step was to clean the given data and convert it into the proper format which in our case is CSV.

We cleaned the data and stored it in a CSVfile.

Exploratory Data Analysis

The dataset consisted of 31 columns.
All the values present in the dataset are numeric values.
There were 11055 rows present in the dataset.
There were no null values present in the dataset.

Description of Features

having_ip_address

If the Domain Part has an IP Address then -1
If the Domain Part doesn’t have IP Address then 1

2. URL_Length

URL length<54 then -1
URL length >= 54 and URL length <= 75 then 0
URL length > 75 then 1

3. Shortining_Service

TinyURL then -1
Normal URL then 1

4. having_At_Symbol

URL having @ symbol then -1
URL doesn’t contain @ then 1

5. double_slash_redirect

The position of the last occurrence of “//” in the URL > 7 then -1
Otherwise 1

6. Prefix_Suffix

Domain part includes symbol ‘-’ then -1
Domain doesn’t contain symbol ‘-’ then 1

7. having_Sub_Domain

If there is one dot in the domain part then 1
If there are two dot in domain part then 0
If there are more than three dots in domain part then -1

8. SSLfinal_State

Use HTTPS and issuer is trusted and age of certificate≥ 1 Years then 1
Use HTTPS but issuer is not trusted then 0
Otherwise 1

9. Domain_registration_length

Domains Expires on ≤ 1 years then -1
Otherwise 1

10. Favicon

Favicon Loaded From External Domain then -1
Otherwise 1

11. port

Port # is of the Preffered Status then 1
Otherwise -1

12. HTTPS_token

Using HTTP token in domain part of The URL then 1
Otherwise -1

13. Request_URL Request URL examines whether the external objects contained within a webpage such as images, videos and sounds are loaded from another domain. In legitimate webpages, the webpage address and most of objects embedded within the webpage are sharing the same domain.

% of Request URL <22% then 1
otherwise -1

14. URL_of_Anchor

If % of URL Of Anchor < 31% then 1
If % of URL Of Anchor >=31 and <= 67% then 0
If % of URL Of Anchor > 67% then 1

15. Links_in_tags

16. SFH

Server Form Handler
SFH is “about: blank” Or Is Empty then -1
SFH “Refers To “ A Different Domain then 0
Otherwise 1

17. Submitting_to_email

If website is using “mail()” or “mailto:” function to submit user information then -1
Otherwise 1

18. Abnormal_URL

If Host name is not included in URL then 1
Otherwise 1

19. Redirect

If number of redirects is less than 1 then 1
Otherwise -1

20. on_mouseover

If the status bar changes on mouseover then -1
Otherwise 1

21. RightClick

If rightclick is disabled then -1
Otherwise 1

22. popUpWindow

If the popup window contains text then -1
Otherwise 1

23. Iframe

If the website is using iframe then -1
Otherwise 1

24. age_of_domain

If age of domain is greater than 6 months then 1
Otherwise 1

25. DNSRecord

If there is no DNS record for the given domain then -1
Otherwise 1

26. web_traffic

If Website rank < 100,000 then 1
If Website rank > 100,000 then 0
Otherwise -1

27. PageRank

If PageRank < 0.2 then -1
Otherwise 1

28. Google_Index

If webpage indexed by Google then 1
Otherwise -1

29. Links_pointing_to_page

If no of links pointing to the webpage is 0 then -1
If no of links pointing to the webpage is 1 or 2 then 0
Otherwise 1

30. Statistical Report

If Host belongs to top Phishing IPs then -1
otherwise 1

31. Result

If website is non phishing then 1
If website is phishing website then -1

Univariate Analysis

We plotted the bar plots to check how data is distributed.

As we can see our Target variable Result is almost balanced with slightly unbalanced.

Bi-variate analysis

We plotted each of the feature with hue as result.

Observations

There is strong relation between prefix_suffix and result. If the website is not having ‘-’ in the domain name then there are more chances of it being a legitimate website.
If there are more than one sub domains then the chances that website being phising website increases.
SSLfinal state is having strong relation with result and can be important factor in deciding the website phishing.
If domain registration is more than one year then chances of that website being phishing website is less.
URL_of_Anchor feature can be critical in determining if a website is phishing or not

We then performed the Chiquare test to find the relationship between the variables with the column result.

We got these features as significant features.

having_IP_Address, URL_Length, Shortining_Service, having_At_Symbol, double_slash_redirecting, Prefix_Suffix, having_Sub_Domain, SSLfinal_State, Domain_registeration_length, port, HTTPS_token, Request_URL, URL_of_Anchor, Links_in_tags, SFH, Abnormal_URL, Redirect, on_mouseover, age_of_domain, DNSRecord, web_traffic, Page_Rank, Google_Index, Links_pointing_to_page, Statistical_report.

Next part will be model building which I will discuss in next part of the blog.

Click here for part 2.

Link to the GitHub:-

https://github.com/MaheshKumarsg036/streamlit-website-phishing