Phishing Email Classification

Sharanbasav Sumbad
11 min readMay 12, 2022

--

Phishing Email Detection using ML and NLP techniques.

Fishing Phishing Emails

The detection method for phishing emails using machine learning mainly uses marked phishing emails and legitimate emails to train the classification algorithm in the machine learning algorithm to obtain the classifier model for email classification.

PROBLEM STATEMENT:

Scam emails have become a common occurrence in today’s culture. The goal of this project is to create a system that uses machine learning and natural language processing techniques to determine whether or not an email is trustworthy.

For this task, we built a machine learning classifier that can calculate the phishing probability of an email. The model input consists of features and attributes of a specific email, and the desired output is “phishing” or “not phishing”.

Basic ML Approach to the Problem Statement

DATA COLLECTION:

The Data for this Project was Borrowed from the Authors of the below cited Paper

“Verma, R. M., Zeng, V., & Faridi, H. (2019). Data Quality for Security Challenges: Case Studies of Phishing, Malware and Intrusion Detection Datasets. Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, 2605–2607. Presented at the London, United Kingdom. doi:10.1145/3319535.3363267”

This proprietary data can be collected by connecting with the author. The author manually collected every email that are phishing and legit and stored them as text files (.txt extension) with and without the email headers. The data can be verified either by contacting the author or by visiting the citation of the paper given above.

Data Description:

The Initial data that we obtained was as text files. There were about 500 text files that were labeled as legit and around 4000 text files that were labeled as Phishing. Each individual text file had Email heads which will be used in pour analysis

In an e-mail, the body (content text) is always preceded by header lines that identify the routing information of the message, including the sender, recipient, date, and subject. Some headers are mandatory, such as the FROM, TO and DATE headers. Others are optional, but very commonly used, such as SUBJECT and CC. Other headers include the sending time stamps and the receiving time stamps of all mail transfer agents that have received and sent the message. In other words, any time a message is transferred from one user to another (i.e., when it is sent or forwarded), the message is date/time stamped by a mail transfer agent (MTA) — a computer program or software agent that facilitates the transfer of email message from one computer to another. This date/time stamp, like FROM, TO, and SUBJECT, becomes one of the many headers that precede the body of an email.

Here is a sample of an Email text file:

We can see the Email contains various headers . For Description of headers refer this : https://mediatemple.net/community/products/dv/204643950/understanding-an-email-header
Initial text file sample of an email

Data Preparation:

For initial analysis, we run a python script to read through all the text files and read the contents of the text files to a data frame and label each individual text file as 0 for Legit and 1 for Phish. Then we also processed the data frame and extracted the headers of the text file using the python email parser library and extract the required hears for now that we initially plan to use.

Initial Data Frame:

All the required data from text files put into a data frame for further processing

From all the headers in present in an email we choose to work with only the FROM, DATE, TO, SUBJECT and BODY as these are the main headers and can be found across all the emails.

Data Columns:

file_name: This shows the text file name from where the data was extracted from.

From: This displays who the message is from, however, this can be easily forged and can be the least reliable.

Subject: This is what the sender placed as a topic of the email content.

Date: This shows the date and time the email message was composed.

To: This shows to whom the message was addressed but may not contain the recipient’s address.

body: This is the actual content of the email itself, written by the sender.

Phish: This Indicated if the email was Phishing or Legit

Exploratory Data Analysis:

The next steps after the data were prepared for preprocessing, we had to clean the data, we did not have any missing columns or records the ingestion was done manually. Since we are dealing with text data, we had to use NLP for cleaning and text analysis.

Data cleaning steps:

1. Removal of stop words

2. Removal of punctuation marks.

3. Removal of special characters

4. Tokenization

5. Capitalization of words

The following above steps were applied to the subject and body of the data frame columns for initials analysis

Words that appear in the Phising emails
Phishing Mails Subject Word Cloud
Bar Chart of Top 20 words that appear in the subject of the Phishing Emails

The above Bar chart shows the top 20 frequent words that appear in the Subject of the email which is not legit. From our analysis of the corpus, we have seen that word ‘Account’ that appears most frequent (148 times) that are phishing. That makes a lot of sense as most phishing emails tend to get users' bank account details or other financial account details that need to target. The next most frequent word is the word ‘PayPal’ (63 times), as we have Identified PayPal as a software platform that deals with banks and customers with transaction-related queries.

Legit Mails Subject Word Cloud
Bar Chart of Top 20 words that appear in the subject of the Legit Emails

We also look at the words that appear in the Legit set of email’s subjects, the words such as ‘Video’, ‘New’, ‘Trump’, and ‘Call’ appear most frequently which I suppose are random to any corpus. As of now, we cannot derive any relation between the words as we did for phishing emails but through further analysis maybe we can pinpoint the characteristics. But surely, we can say the word ‘video’ appeared more in the legit documents the most, this may be contributed to the factor that the data is imbalanced but also, according to that logic the word ‘video’ should have appeared far a greater number of times.

Bar Chart of Top 20 words that appear in all the emails

As we have analyzed the subjects, we also analyze the whole corpus including the phishing and non-phishing emails, we see that the word ‘Trump’, ‘Donald’, ‘State’, ‘Republican’, ‘Democratic’, these set of words that appear in top 20 of the corpus can be said or tagged as a political category and again ‘Account’ also seems to appears as of the top twenty words in the corpus which was also in the Phishing corpus.

Class Distribution of the Emails

This is the data distribution of the corpus as legit and phish. ‘0’ is labeled for legit and ‘1’ is labeled for phish in the data frame. As initially pointed out, we have around 4000 legit emails and 500 phish emails to train our classifier.

FEATURE ENGINEERING:

From the analysis, we can think of simple approaches to add in features that may give better performance in the results, Following are the features that were extracted:

  • Frequency of top 5 words in Phishing emails
  • Frequency of top 5 words in legit emails
  • Frequency of uppercase letters
  • Frequency of punctuations
  • Frequency of stop words
  • Datetime to hours and minute
Data frame after Feature Engineering

The above figure shows all the features that we were able to extract from our text preprocessing.

body_stop_frqq: Frequency of stop words in the body of a particular email

sub_stop_frqq: Frequency of stop words in the subject of a particular email

datehour: Hour of the day of a particular email it was sent

sub_uppercase_cnt: Frequency of Capital letters in the Subject of a particular email

body_uppercase_cnt: Frequency of Capital letters in the body of a particular email

sub_punc_cnt: Frequency of punctuation in the subject of a particular email

body_punc_cnt: Frequency of punctuation in the body of a particular email

body_top5_legit_cnt: Frequency of top 5 words in legit emails in a particular email

bosy_top5_phish_cnt: Frequency of top 5 words in phishing emails in a particular email

dateminute: a minute of the Hour of a particular email it was sent

As we analyzed we found out most of the phishing emails have an unfamiliar tone of greeting. When reading improperly used words. Grammar and spelling errors. Inconsistencies such as in email subject characters in upper case, Threats or a sense of urgency, and unusual requests. These observations are supported by an analysis of the top 20 words, as we saw above, the phishing emails have words related to banking terms, like Payment, Account, and so on. Based on this intuition we choose to extract the count of the top 5 words in phishing emails and legit and use them as features.

MODELING AND RESULTS:

From the analysis, we go ahead with the count vectorization method and added in features such as count of punctuation, count of capital letters, and so on. Since the count vectorization would result in high dimensional data (more than 14,000 words in the corpus), we used the word frequency and designed a feature to just keep the top 1000 words and use their frequency as features.

For data sub-setting, while training our model and testing the trained model on the test we use the Sklearn train test split. Also as mentioned before while using the count vectorization we don’t necessarily use all the words in the text data, while using count vectorization, we remove stop words, punctuation, all numeric characters, empty strings, and all other non-English words. and use only top 1000 words and their frequency for classification. This is done to avoid the curse of dimensionality in our ML algorithm.

In this section, we talk about the ML algorithm that we will be using for the classification, for the first instinct approach we went ahead with the Multinomial Naïve Bayes Algorithm. There are thousands of software or tools for the analysis of numerical data but there are very few for texts. Multinomial Naive Bayes is one of the most popular supervised learning classifications that is used for the analysis of categorical text data.

The multinomial Naive Bayes algorithm is a probabilistic learning method that is mostly used in Natural Language Processing (NLP). The algorithm is based on the Bayes theorem and predicts the tag of a text such as a piece of email or newspaper article. It calculates the probability of each tag for a given sample and then gives the tag with the highest probability as output. A naive Bayes classifier is a collection of many algorithms where all the algorithms share one common principle, and that is each feature being classified is not related to any other feature. The presence or absence of a feature does not affect the presence or absence of the other feature.

Naive Bayes is a powerful algorithm that is used for text data analysis and with problems with multiple classes. To understand the Naive Bayes theorem’s working, it is important to understand the Bayes theorem concept first as it is based on the latter. Bayes theorem, formulated by Thomas Bayes, calculates the probability of an event occurring based on the prior knowledge of conditions related to an event. It is based on the following formula:

P(A|B) = P(A) * P(B|A)/P(B) Where we are calculating the probability of class A when predictor B is already provided.

The Multinomial Naive Bayes can be accepted as the probabilistic approach to classifying documents in the case of acknowledging the frequency of a specified word in a text document. The term “bag of words” is widely used as the selected document to be processed under the context of Naive Bayes while depicting the document itself as a bag and each vocabulary in the texture as the items in the bag by permitting multiple occurrences. To be able to properly classify, the existence of the word in the given text shall be known beforehand. This classifier achieves well on discrete types as the number of words found in a document. An example of the usage area of this approach can be the prediction of the matching category of the document with the help of the occurrences of the words allocated in the document. The output of this algorithm produces a vector composed of integer frequency values of the word set.

Before diving into our results let's see how we can evaluate our model

Metrics and Evaluation

One of the most common types of performance evaluators of a text classifier is cross-validation. In this method, training datasets are randomly divided into same-length data sets. Then, for each same-length set, the text classifier is trained with the remaining sets to test predictions. This helps classifiers make predictions for each corresponding set, which they compare with human-tagged data to avoid false positives or false negatives.

These results lead to valuable metrics that demonstrate how effective a classifier is:

· Accuracy: percentage of texts that were predicted with the correct tag.

· Precision: percentage of texts the classifier got right out of the total number of examples it predicted for a specific tag.

· Recall: the percentage of examples the classifier predicted for a specific tag out of the total number of examples it should have predicted for that tag.

· F1 Score: a harmonic mean between precision and recall.

Since we are dealing with an imbalanced dataset we will focus more on Precision and Recall of the model

Naive Bayes Results

The above code snippet is the result of the base model of the Multinomial Naïve Byes approach, The results shown at this stage are good enough to continue this approach, But let's see the results of a few more Basic models.

Logistic Regression model:

Logistic regression model Results

Random Forest model:

Random Forest model Results
Feature importance from Random Forest
Bar chart of Feature Importance

From the above Bar graph, we can see that the Features that we extracted are major contributors to the classification of the email.

How is the feature importance calculated in the random forest? Click here for the answer.

CONCLUSION AND FINAL WORDS

By the complete process, you might have realized by now that this project was more data-centric than model-centric, more focused on approaches that process the data to get better results. Following Occam's Razor theory simple approaches that are trivial when you suspect a fishy email has been applied from my perspective.

Coming to the pain points of this project, I would like to include parameter tuning would be next steps, Domain knowledge in the literal and metaphorical sense would have been an added advantage (if you might have noticed the user information like sender, receiver details are missing due to protect privacy ) and most importantly the limited data in this area, having imbalanced data does no good to our cause. Also one of the reasons why less contribution is done in phishing email classification than compared to Spam email classification.

Overall the current models yield fairly good results but definitely can be improved by other approaches such as Neural Nets, Sentiment Analysis, and so on.

--

--