Phishing Websites EDA (Exploratory Data Analysis)

“Is this a trusted link?”

In today’s digital age, and with the digital transformation that we are witnessing nowadays not locally in our country (Saudi Arabia) only but in the whole world, cybercrime affects all of us directly or indirectly because as individuals and companies, we all have information that is worth something to cybercriminals.

The most cybercrime common way to attack an induvial is phishing websites especially lately due to COVID-19, the use of the internet and E-commerce, E-governance sites has increased.

Dr.Khalid Bin Abdullah Alsabti the governor of the Cybersecurity Authority has announced on 7th April of 2021 in the Opening of the global cybersecurity conference that phishing sites increased by 300%! This is a huge increase and we must stand against it.

Therefore, I inspired an idea for the first project in T5 Data science Bootcamp that presented by SDAIA (Saudi Data and Artificial Intelligence Authority) where I can combine data science and cybersecurity.

The aim of this analytical project is to find out the similar characteristics of links that are suspicious to be phishing links and measure the awareness toward this common cybercrime method.

Good protection comes first by educating myself and the people around me.

Data collection

I used two datasets:

1. Phishing Websites Dataset

  • The biggest challenge I faced is the lack of finding local datasets about Phishing websites so I extracted a worldwide one to find out the common features of phishing links.
  • Dataset published in 2019.
  • Dataset source: https://www.sciencedirect.com/science/article/pii/S2352340920313202
  • It has 112 columns (Features) and 58,646 rows (Records). which I will shrink them later to meet the expectations of my project.
  • Data type: (String, Integer)

2. Cybercrime Awareness Among Saudi Nationals

  • The only Dataset that I found to measure the level of cybercrime awareness in Saudi Arabia, I wish that I found a dataset about local phishing websites or emails but maybe in the future there will be some! although the records of the sample are less than 1,500 but it was useful in this stage.
  • Dataset published in Jun-2021.
  • Dataset Source: https://www.sciencedirect.com/science/article/pii/S2352340921002493
  • It has 64 columns (Features) and 1,231 rows (Records).
  • Data type: (String)

Tools

  • Programs: Python using Jupyter notebook.
  • Libraries: Pandas, Numpy, Matplotlib, Seaborn
  • Plots: Bar Charts, Pie Charts.

Data Preprocessing and Exploration

First I'll import all libraries that I’ll use.

  1. Phishing Websites Dataset Cleaning

The dataset got 112 Features and 58,646 rows divided into 6 tables, Each table indicates a specific part of the link.

We must know in advance that any link is considered as (String) and this string is divided into 4 sub-string as we can see in the figure below.

The dataset divided the symbols allocated in each part, some data scientists want to check the domain, Some want to check the dictionary while my main goal was checking the whole URL, so I shrank the dataset to meet the expected goal.

Drop

So here I just extract the features that include “URL” in the header, such as “URL length”, “Quantity of $ in URL”, “Quantity of & in URL” and so on.. except the “Phishing” column which is the target.

Replace

All values should be (value=>0), while some values were -1 which doesn’t make sense! either there will be a symbol or not, so I replaced all -1 with 0

Rename

Rename the header of columns so then display it will be readable by replacing all underscore with space and each “qty” with “Quantity”.

So the dataset just shrank and became 22 columns (features) and 30,647 rows.

Each column contains a symbol and how many times it’s repeated in a phishing link.

I sum all the symbols per their columns, then sorted them.

I made new_dct to represent the top 7 columns in a graph which we will see later in visualization and deliverable

2. Cybercrime Awareness Among Saudi Nationalities Dataset Cleaning

The dataset is simply an excel file survey about cybercrime awareness that got 64 columns (questions) and 1,211 rows, the only two questions that meet my project goal

“How much do you check the source of a website before accessing it ?” and “Have you been a victim of cybercrime?”

Reading the Dataset

As I mentioned before that I’ll use only two columns from the excel file so I can meet the expected goal of my EDA project, Here I wanted to clarify that I dived the users into categories based on their responses to the question “How much do you check the legitimacy of a website before accessing it?” 5 responses include (Always, Sometimes, Often, Seldom, Never).

The second question was “ Have you been a victim of cybercrime?” the answers were either “Yes/No”
so I sum all users with the response “Always” and “Yes”= sum1, “Sometimes” and “Yes” =sum2, … “Always” and “No” = sumN “Sometimes” and “No” = sumN2..

It will help me later in analyzing data.

Data Analysis and Visualization

At first, I was curious about how many users from the local sample that I have whose response was “Always” make sure they check the legitimacy (source) of the link before accessing it.

38% of Saudis in this sample make sure to check the source and almost 42% “Often” do which is satisfying enough to indicate the high awareness of cybercrime that might accrue by phishing links.

let’s check how much of them got phished

Almost 22% of the users got phished at least once! why?

let’s check if even those who make sure to check the source of the link are included in the area of users who got phished?

Yes they got phished as we can see in the upper graph, and even the number of users who always make sure to check the source of the link are near to those who often check it

What do you think the reason behind that? and how we can avoid such a thing from happening in the future?

Communication and Deliverables

I took all the symbols that potentially stuffed in the phishing links and represented them in the bar chart to see which factor will be the red flag to the user, As we can see below that the URL length is the most likely factor considered.

The length of URL was the highest with a value over 2 million, so it will be a big red flag to any user who is willing to click on a link.

Let’s see more…

Here I show the top 7 repetitive symbols after URL length, they are:

  1. Dots (.) with value over 130,000 times
  2. Slashes (\)
  3. TLD (.com)
  4. Hyphens (-)
  5. Equals (=)
  6. Ands (&)
  7. Underlines (_)

However, the repetition of these symbols should notify the user to be careful before accessing the link.

Future Work

Due to the limited time that we have in this Bootcamp to deliver the outcomes of EDA project which is less than 2 weeks, I stopped tell this point but in the future work I’m planning to:

  1. Make a Model that will allow the user to insert any link that he/she is suspicious about and the model will represent how much out of 100% it’s suspicious to be phishing.
  2. Making a huge local dataset of phishing websites similar to PhishTank.org

Data Scientist, Interested in Machine learning | Data analysis | Artificial intelligence | Cybersecurity