Creating a Spam Email Classifier using Python
Tired of receiving unwanted emails? By training a ML model to classify emails as Spam or Ham, you can cleanly filter the massive amounts of incoming emails every day. For this task I am going to try to build a classifier that detects emails as either spam or ham (emails intended to be read). To do this we will use data from https://spamassassin.apache.org/old/publiccorpus/. Using the files easy_ham & spam we will parse the subject & body of every email in both folders. Let’s jump into some of the code & get started with the data preprocessing & cleaning steps.
We will start with importing every library we may need for this project and write a script to parse the email files using BeautifulSoup. More information on this can be found at this link as well as other sources: https://docs.python.org/3/library/email.parser.html. Our goal is to import 2500 spam emails & 500 ham emails to save them into a dataframe. We will then add a column of binary values to represent the emails as spam or ham. In the code snippet below we create a function that reads every file in the folders easy_ham & spam, then adds them to a list. We call that function with both the ham & spam paths as well as add a series of 1s or 0s to the dataframe. We then add both the ham & spam emails along with their respective binary identifiers, change the type of both columns and replace all urls with ‘URL’. This completes our data cleaning & processing step.
In the end our dataframe looks like this:
Our next step is to convert the emails to tokens or individual words & convert them to tf-idf values. We can do this using the TfidfVectorizer from scikit learn and add the argument stop words = english to remove stop words or filler words. This is a very common practice in text processing since most ML algorithms require scaled numerical data to build a model of the dataset. We then place these values into a new dataframe called “vect” and split the dataframe using train_text_split.
After the data is split into tf-idf tokens we can run various algorithms and compare performance on the training & test sets. We are looking for a model that has both high precision & recall on the training & test sets for this case. I used multiple algorithms for this but determined that SGDClassifier had the best results. Using multinomial Naive Bayes & logistic regression I was able to get mediocre results since the precision was fantastic but the recall under performing of other models. I then tried to implement KNN with a n_neighbors value of 5 but the performance metrics were underwhelming & the computational time was too long. From there I continued tuning my model using the SGDDClassifier.
To view the confusion matrix I used the function show below. In addition, I printed the precision, recall & accuracy of the model for both the training & test set.
Using the confusion matrix plot function I displayed the confusion matrix for the training & test set shown below.
I then wanted to try out different training set sizes to determine which would give the best performance on the SGD model. Using the script above & the plot learning curve function I was able to determine which values had the best test set accuracy. As seen from the y-axis here, the difference between training size did not affect the test set accuracy that much.
Finally we would like to see the tokens or words that have the largest probability of indicating a. spam email. To do this, I partitioned the dataset into ham & spam, then determined the count of nonzero values for each token & divided it by the total number of emails. Finally I inputted the values into a dataframe and created a new column of the log token value to show the top 5 words that have the largest significance of predicting a spam email. The results are shown in the dataframe below & are a little hard to interpret but its important to see what values are primarily driving your results.
Link to the full code is attached here: https://github.com/colaso96/Spam_or_ham