Spam or Ham? Solving a problem on Kaggle

I’ve been learning some coding and data science skills in the past year and I thought I should document and share my progress to…

  1. Keep myself more accountable
  2. Potentially help anyone else that is trying to learn the same things I am!

So here we are. I have a Github up where I will post anything that goes on here as well as a few projects that don’t. I work for a startup hedgefund and my job requires me to code a lot and unfortunately I can’t share that part of my learning experience with you all.

I began with this Kaggle page and if you don’t have access to Kaggle you can find the dataset publically here. Thanks to the UCI for the great work they do. I’m using Python 2.7 written and executed in Atom on a Surface Book.

First of all, three libraries are required: pandas, numpy, and sklearn. If this is your first exposure to Python, these can be installed with the pip install command from command prompt or within your IDE. Like the following:

pip install pandas
pip install numpy
pip install sklearn

Make sure you install all three! Go ahead and import pandas and numpy like the following so we can use them.

import pandas as pd
import numpy as np

Our first step in this project, and in most, is to load our data. Download the dataset from either Kaggle or UCI above and make note of where it’s saved and what it’s called. Then, use pandas and the read_csv command to load the data. While we’re at it, let’s use .head() to look at the first few rows and make sure everything looks okay.

data = pd.read_csv("C:\\your\\drive\\here\\spam.csv", encoding='latin-1')

Your output should look like this:

Not so pretty, but it works. Let’s tidy that up a bit. Delete the unused rows and rename the remaining to names that make more sense.

data = data.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis = 1)
data = data.rename(columns={"v1":"label","v2":"text"})

Great! Let’s perform a little exploratory analysis.


This tells us a few things. First, that our tidying up worked and only two columns of data remain. Second, there are 5575 observations all of which are ‘objects.’ Third, it gives us a breakdown of how much is ham and how much is spam. (4825/747 or 13.4% spam 86.6% ham).

Our next step is to create a dummy variable that identifies the spam from the ham. Effectively, we’re looking at the label column and saying “Is this observation spam? If so then let’s label it 1. If not label it 0.” I’m also displaying the first few observations to make sure everything is alright as well.

data['label_dummy'] ={'ham':0, 'spam':1})

Things get a bit more complex here! In order to reduce overfitting problems we split our dataset into test and training sets. We will ‘train’ a model using the training set and then ‘validate’ the model using the test set. We’re using sklearn for this, so go ahead and add the following code to the top.

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

For now we’re only using train_test_split but we will be using the rest of these later!

Here’s how the actual splitting into training and testing datasets happens:

x_train, x_test, y_train, y_test = train_test_split(data["text"], data["label"], test_size = 0.2, random_state = 42)

Not only are we splitting our data into training and testing datasets, we’re also dividing it up into X & Y datasets. This will make it more simple to feed into our model! Also, in order to ensure we are using the same ‘split’ of data each time if we were to come back to this code we use a particular random state, 42. Don’t dwell on this number very much, but it relates to how the data is split up.

We also print out our new dataset’s shape to demonstrate that they are all the same! In this case the train datasets have the shape (4457,) and the test (1115,)

To perform further analysis we need to transform our data into a format that can be processed by our neural network models. We’re going to create a list of all the words used in all of our texts first and then print some of the entries.

vect = CountVectorizer()

Your output should be:

['00', '000', '000pes', '008704050406', '0089', '0121', '01223585236', '0125698789', '02', '0207', '02072069400', '02073162414', '02085076972', '021', '03', '04', '0430', '05', '050703', '0578', '06', '07', '07046744435', '07090201529', '07090298926', '07099833605', '07123456789', '0721072', '07732584351', '07734396839', '07742676969', '07753741225', '0776xxxxxxx', '07781482378', '07786200117', '077xxx', '07801543489', '07808', '07808247860', '07808726822', '07821230901', '07880867867', '0789xxxxxxx', '07946746291', '0796xxxxxx', '07973788240', '07xxxxxxxxx', '0800', '08000407165', '08000776320']
['aphex��s', 'apnt', 'apo', 'apologetic', 'apologise', 'apologize', 'apology', 'app', 'apparently', 'appeal', 'appear', 'applausestore', 'applebees', 'apply', 'applyed', 'appointment', 'appointments', 'appreciate', 'appreciated', 'approaches', 'approaching', 'approve', 'approx', 'apps', 'appt', 'appy', 'april', 'aproach', 'apt', 'aptitude', 'aquarius', 'ar', 'arab', 'arabian', 'arcade', 'ard', 'are', 'area', 'aren', 'arent', 'areyouunique', 'argh', 'argue', 'argument', 'arguments', 'aries', 'arises', 'arithmetic', 'arm', 'armand']

Our next step is to transform our data using this! My understanding of what we’re doing here is the following: we use the vector we created before to look at each text in our data and count each time a word appear. These numbers are then stored in a very, very large vector that also record all the words that didn’t appear in our text. If we printed it, it would look something like this:


Luckily it’s quite easy for us to do this! Run the following command:

x_train_vect = vect.transform(x_train)
x_test_vect = vect.transform(x_test)

Now we can actually train our models! I’m going to try two approaches, using logistic regression and using K nearest neighbors. Let’s start with logistic regression because that’s a bit more simple.

Logistic regression is great if you’re working with binary variables, things that have two possible ‘states’ or ‘settings.’ In our case these are spam or ham! Logistic regression converts a lot of input variables into the probability of a ‘state’ which is useful for us because we have a lot of data and just want to know if there is a high probability if a text is spam or not. I won’t go into the maths here, but if you’re interested in statistics you should check out the Wikipedia page.

Using logistic regression is simple with sklearn. All we need to do is tell a model what kind of model it is and fit it with our data like so:

model = LogisticRegression(), y_train)

We’re ready to make some predictions now! Since we’re going to be making a few predictions we’ll store then in a dictionary. The actual prediction part is in bold. Notice how we’re using our test dataset.

prediction = dict()
prediction['Logistic'] = model.predict(x_test_vect)

Ok, but how good are our predictions? Let’s look.

print(accuracy_score(y_test, prediction['Logistic']))

Which outputs…


97.8%! That’s an excellent rate. Out of 1000 texts we’d only get 22 wrong.

Let’s see if we can do better. We’re going to use the K Nearest Neighbors method. Again, I won’t go into the maths here, but if you’re interested check out this YouTube video.

Normally a K-NN model looks like this:

model2 = KNeighborsClassifier(n_neighbors = 5)

However, we’re not sure if 5 is the right number! So let’s throw a lot of numbers at it and pick the best one. The below will generate an array of all integers from 1 to 20 for us to use:

k_nums = np.arange(1,20)

Great, now we need to manipulate it into a format that our model can use.

param_grid = dict(n_neighbors = k_nums)

We can initialize and run our model now! Since we are effectively generating 20 models here this will take more time to process than the logistic regression model did.

model2 = KNeighborsClassifier()
gridsearch = GridSearchCV(model2, param_grid), y_train)

Great! Let’s output the results from each of our 20 models and see which did the best.


Which outputs…

[mean: 0.94301, std: 0.00428, params: {'n_neighbors': 1}, 
mean: 0.92013, std: 0.00367, params: {'n_neighbors': 2},
mean: 0.92259, std: 0.00417, params: {'n_neighbors': 3},
mean: 0.90532, std: 0.00179, params: {'n_neighbors': 4},
mean: 0.90689, std: 0.00209, params: {'n_neighbors': 5},
mean: 0.89500, std: 0.00058, params: {'n_neighbors': 6},
mean: 0.89567, std: 0.00113, params: {'n_neighbors': 7},
mean: 0.88490, std: 0.00197, params: {'n_neighbors': 8},
mean: 0.88512, std: 0.00222, params: {'n_neighbors': 9},
mean: 0.87996, std: 0.00083, params: {'n_neighbors': 10},
mean: 0.88041, std: 0.00113, params: {'n_neighbors': 11},
mean: 0.87458, std: 0.00138, params: {'n_neighbors': 12},
mean: 0.87458, std: 0.00138, params: {'n_neighbors': 13},
mean: 0.87121, std: 0.00030, params: {'n_neighbors': 14},
mean: 0.87121, std: 0.00030, params: {'n_neighbors': 15},
mean: 0.86897, std: 0.00034, params: {'n_neighbors': 16},
mean: 0.86942, std: 0.00059, params: {'n_neighbors': 17},
mean: 0.86852, std: 0.00088, params: {'n_neighbors': 18},
mean: 0.86852, std: 0.00088, params: {'n_neighbors': 19}]

For our purposes, all we need to pay attention to is the mean which is equivalent to accuracy. Clearly K =1 is the best model we have for this data with an accuracy of 94.3%! Not as good as logistic regression, but still excellent.

Let’s rerun our model to make the rest of our data analysis a bit easier.

model3 = KNeighborsClassifier(n_neighbors = 1), y_train)

Now we’ll predict some values from our test data set and compare them to the logistic regression model to compare them a bit more in depth.

prediction[‘KNN’] = model3.predict(x_test_vect)

We already know the accuracy of our model but if we didn’t here is where we would generate that! Instead we will make “confusion matrices” for both of our models to see how they compare. If this is your first time interacting with confusion matrices here’s a rundown

Confusion matrices are a way of visualizing how our models perform at classifying objects. It lets us see how often we get things right, and if we get a particular type of object wrong more, so perhaps our model is great at classifying regular texts but lets a lot of spam through. A confusion matrix would show that. In our case positive is spam and negative is ham. Let’s generate one and print it out.

conf_mat_logist = confusion_matrix(y_test, prediction[‘Logistic’])
conf_mat_knn = confusion_matrix(y_test, prediction[‘KNN’])

Which gives us….

[[964   1]
[ 23 127]]

Great results! We classified (964 + 127) = 1091 out of 1115 texts correctly. Moreover, we only classify 1 real text (~0.1%!) as spam when it was a real message. How did our other model do?


Which gives us…

[[965   0]
[ 48 102]]

Not as good of results with correctly identifying spam but it never falsely identifies a real message as spam, so there’s that. Overall still good results.

Let’s drive a bit deeper into the ‘23’ and ‘90’ in our confusion matrix, or the spam that we didn’t detect.

pd.set_option(‘display.max_colwidth’, -1)
print(x_test[y_test > prediction[‘Logistic’]][:10])
print(x_test[y_test > prediction[‘KNN’]][:10])

The above commands are essentially saying ‘print x_test values where y_test is greater than our predicted value.’ If you recall, earlier we defined spam = 1 and ham = 0, so we’re asking Python to print the spam values of x_test that we thought were ham since 1 > 0. To make things clean we also set out max column width to be large and append our print commands with [:10] to only display 10 entries. Here’s what we get.

683     Hi I'm sue. I am 20 years old and work as a lapdancer. I love sex. Text me live - I'm i my bedroom now. text SUE to 89555. By TextOperator G2 1DA 150ppmsg 18+
4071 Loans for any purpose even if you have Bad Credit! Tenants Welcome. Call on 08717111821
3979 ringtoneking 84484
751 You have an important customer service announcement from PREMIER.
712 08714712388 between 10am-7pm Cost 10p
1268 Can U get 2 phone NOW? I wanna chat 2 set up meet Call me NOW on 09096102316 U can cum here 2moro Luv JANE xx Calls��1/minmoremobsEMSPOBox45PO139WA
730 Email AlertFrom: Jeri StewartSize: 2KBSubject: Low-cost prescripiton drvgsTo listen to email call 123
2662 Hello darling how are you today? I would love to have a chat, why dont you tell me what you look like and what you are in to sexy?
3130 LookAtMe!: Thanks for your purchase of a video clip from LookAtMe!, you've been charged 35p. Think you can do better? Why not send a video in a MMSto 32323.
1448 As a registered optin subscriber ur draw 4 ��100 gift voucher will be entered on receipt of a correct ans to 80062 Whats No1 in the BBC charts
Name: text, dtype: object
1044    We know someone who you know that fancies you. Call 09058097218 to find out who. POBox 6, LS15HB 150p                                                            
15 XXXMobileMovieClub: To use your credit, click the WAP link in the next txt message or click here>> http://wap.
4071 Loans for any purpose even if you have Bad Credit! Tenants Welcome. Call on 08717111821
2312 (More games from TheDailyDraw) Dear Helen, Dozens of Free Games - with great prizesWith..
3979 ringtoneking 84484
1612 RT-KIng Pro Video Club>> Need help? or call 08701237397 You must be 16+ Club credits redeemable at! Enjoy!
865 Congratulations ur awarded either a yrs supply of CDs from Virgin Records or a Mystery Gift GUARANTEED Call 09061104283 Ts&Cs ��1.50pm approx 3mins
2366 Tone Club: Your subs has now expired 2 re-sub reply MONOC 4 monos or POLYC 4 polys 1 weekly @ 150p per week Txt STOP 2 stop This msg free Stream 0871212025016
2877 Hey Boys. Want hot XXX pics sent direct 2 ur phone? Txt PORN to 69855, 24Hrs free and then just 50p per day. To stop text STOPBCM SF WC1N3XX
712 08714712388 between 10am-7pm Cost 10p

That’s it! You’re done. Sit back and relax for a bit.

Further work on improving these models should study these values that we got wrong. Perhaps a combination of two models could yield better results? Label texts that both system classify as spam as spam for sure, but if only one classifies it as spam label it possible spam. A more advanced project could implement what we’ve made into a Chrome Extension to automatically filter some content or even make a mobile app that filters your messages.