A Comparative Study of Algorithms for Phishing Website Classification — part 1

Mahesh Kumar SG
7 min readJul 14, 2023

--

Introduction

The Internet has become an essential part of our lives. We use it for everything from shopping to socializing to working. But with the increasing use of the internet, there are also increasing risks. Cyberattacks are becoming more sophisticated, and personal information is more valuable than ever before.

There are different kinds of cyber attacks that happen around the world. few of them are listed below.

  • Malware: This is software that is designed to harm a computer system. Malware can take many forms, including viruses, worms, trojans, and ransomware.
  • Phishing: This is a type of social engineering attack that is designed to trick users into giving up their personal information. Phishing attacks often involve sending emails that appear to be from legitimate sources, such as banks or credit card companies. The emails will often contain a link or attachment that, when clicked, will take the user to a fake website that looks like the real website. Once the user enters their personal information on the fake website, the phisher can steal it.
  • DoS (Denial-of-Service) and DDoS (Distributed Denial-of-Service) attacks: These attacks are designed to overwhelm a computer system with traffic, making it unavailable to legitimate users. DoS attacks are typically launched from a single computer, while DDoS attacks are launched from multiple computers.
  • SQL injection: This is a type of attack that is used to exploit vulnerabilities in databases. SQL injection attacks can be used to steal data from databases, modify data in databases, or even take control of databases.
  • Zero-day attacks: These attacks exploit vulnerabilities in software that the software vendor is not aware of. Zero-day attacks are often the most dangerous type of attack because there is no patch available to protect against them.
  • Man-in-the-middle (MITM) attacks: These attacks are designed to intercept communications between two parties. MITM attacks can be used to steal data, modify data, or even take control of communications.
  • Ransomware: This is a type of malware that encrypts a victim’s files and demands a ransom payment in order to decrypt them. Ransomware attacks are often very successful because victims are often willing to pay the ransom in order to get their files back.

In this blog we are going to focus on one such attack which is known are Phishing.

Why is it important to detect a phishing website?

Phishing websites are designed to trick users into entering their personal information, such as passwords, credit card numbers, and social security numbers. Once this information is obtained, the phishers can use it to commit identity theft, fraud, and other crimes.

The financial losses caused by phishing attacks are significant. In 2021, the FBI’s Internet Crime Complaint Center (IC3) received over 241,000 complaints about phishing attacks, with losses exceeding $54 million. In 2021, the average loss per phishing attack was $136. Over 90% of data breaches are caused by phishing attacks.

Phishing website
Phishing website

That’s a lot, isn’t it!!!!

So we must protect ourselves from visiting such phishing websites.

In our project, we are developing a machine-learning model to detect whether a website is phishing or not.

Problem Statment

The objective of this project is to develop an accurate and efficient system for classifying websites as phishing or non-phishing based on a given set of characteristics. Phishing websites are malicious online platforms that aim to deceive users into revealing sensitive information, such as login credentials or financial details, by imitating trustworthy websites. Detecting and identifying phishing websites is crucial in safeguarding users from potential fraud, privacy breaches, and financial losses.

The Dataset used to build the machine-learning models is from the UC Irvine Machine Learning repository.

Mohammad,Rami and McCluskey, Lee. (2015). Phishing Websites. UCI Machine Learning Repository. https://doi.org/10.24432/C51W2X.

Data Cleaning

Data that was present was in the form of arff format. we had to convert the data into the proper format for further use. So the first step was to clean the given data and convert it into the proper format which in our case is CSV.

We cleaned the data and stored it in a CSVfile.

Exploratory Data Analysis

  • The dataset consisted of 31 columns.
  • All the values present in the dataset are numeric values.
  • There were 11055 rows present in the dataset.
  • There were no null values present in the dataset.

Description of Features

  1. having_ip_address
  • If the Domain Part has an IP Address then -1
  • If the Domain Part doesn’t have IP Address then 1

2. URL_Length

  • URL length<54 then -1
  • URL length >= 54 and URL length <= 75 then 0
  • URL length > 75 then 1

3. Shortining_Service

  • TinyURL then -1
  • Normal URL then 1

4. having_At_Symbol

  • URL having @ symbol then -1
  • URL doesn’t contain @ then 1

5. double_slash_redirect

  • The position of the last occurrence of “//” in the URL > 7 then -1
  • Otherwise 1

6. Prefix_Suffix

  • Domain part includes symbol ‘-’ then -1
  • Domain doesn’t contain symbol ‘-’ then 1

7. having_Sub_Domain

  • If there is one dot in the domain part then 1
  • If there are two dot in domain part then 0
  • If there are more than three dots in domain part then -1

8. SSLfinal_State

  • Use HTTPS and issuer is trusted and age of certificate≥ 1 Years then 1
  • Use HTTPS but issuer is not trusted then 0
  • Otherwise 1

9. Domain_registration_length

  • Domains Expires on ≤ 1 years then -1
  • Otherwise 1

10. Favicon

  • Favicon Loaded From External Domain then -1
  • Otherwise 1

11. port

  • Port # is of the Preffered Status then 1
  • Otherwise -1

12. HTTPS_token

  • Using HTTP token in domain part of The URL then 1
  • Otherwise -1

13. Request_URL Request URL examines whether the external objects contained within a webpage such as images, videos and sounds are loaded from another domain. In legitimate webpages, the webpage address and most of objects embedded within the webpage are sharing the same domain.

  • % of Request URL <22% then 1
  • otherwise -1

14. URL_of_Anchor

  • If % of URL Of Anchor < 31% then 1
  • If % of URL Of Anchor >=31 and <= 67% then 0
  • If % of URL Of Anchor > 67% then 1

15. Links_in_tags

16. SFH

  • Server Form Handler
  • SFH is “about: blank” Or Is Empty then -1
  • SFH “Refers To “ A Different Domain then 0
  • Otherwise 1

17. Submitting_to_email

  • If website is using “mail()” or “mailto:” function to submit user information then -1
  • Otherwise 1

18. Abnormal_URL

  • If Host name is not included in URL then 1
  • Otherwise 1

19. Redirect

  • If number of redirects is less than 1 then 1
  • Otherwise -1

20. on_mouseover

  • If the status bar changes on mouseover then -1
  • Otherwise 1

21. RightClick

  • If rightclick is disabled then -1
  • Otherwise 1

22. popUpWindow

  • If the popup window contains text then -1
  • Otherwise 1

23. Iframe

  • If the website is using iframe then -1
  • Otherwise 1

24. age_of_domain

  • If age of domain is greater than 6 months then 1
  • Otherwise 1

25. DNSRecord

  • If there is no DNS record for the given domain then -1
  • Otherwise 1

26. web_traffic

  • If Website rank < 100,000 then 1
  • If Website rank > 100,000 then 0
  • Otherwise -1

27. PageRank

  • If PageRank < 0.2 then -1
  • Otherwise 1

28. Google_Index

  • If webpage indexed by Google then 1
  • Otherwise -1

29. Links_pointing_to_page

  • If no of links pointing to the webpage is 0 then -1
  • If no of links pointing to the webpage is 1 or 2 then 0
  • Otherwise 1

30. Statistical Report

  • If Host belongs to top Phishing IPs then -1
  • otherwise 1

31. Result

  • If website is non phishing then 1
  • If website is phishing website then -1

Univariate Analysis

We plotted the bar plots to check how data is distributed.

As we can see our Target variable Result is almost balanced with slightly unbalanced.

Bi-variate analysis

We plotted each of the feature with hue as result.

Observations

  • There is strong relation between prefix_suffix and result. If the website is not having ‘-’ in the domain name then there are more chances of it being a legitimate website.
  • If there are more than one sub domains then the chances that website being phising website increases.
  • SSLfinal state is having strong relation with result and can be important factor in deciding the website phishing.
  • If domain registration is more than one year then chances of that website being phishing website is less.
  • URL_of_Anchor feature can be critical in determining if a website is phishing or not

We then performed the Chiquare test to find the relationship between the variables with the column result.

We got these features as significant features.

having_IP_Address, URL_Length, Shortining_Service, having_At_Symbol, double_slash_redirecting, Prefix_Suffix, having_Sub_Domain, SSLfinal_State, Domain_registeration_length, port, HTTPS_token, Request_URL, URL_of_Anchor, Links_in_tags, SFH, Abnormal_URL, Redirect, on_mouseover, age_of_domain, DNSRecord, web_traffic, Page_Rank, Google_Index, Links_pointing_to_page, Statistical_report.

Next part will be model building which I will discuss in next part of the blog.

Click here for part 2.

Link to the GitHub:-

https://github.com/MaheshKumarsg036/streamlit-website-phishing

--

--