Building offline iPhone spam classifier using CoreML

Darshan Sonde
Oct 30, 2017 · 8 min read

iOS 11 introduced message extension to filter spam messages and coreml to build custom machine learned models to predict spam or not. In this article I’ll go over all the steps I took to build a machine learning model and how I added it to an iphone project to predict spam.

  1. Building a ML model to predict spam
  2. Converting ML model into coreml and embedding into iOS app

End Result

1. Building ML model to predict spam

You can view this part in notebook here

I use python 2.7 for all code samples below.

Data

We will use UCI Spam Collection Data Set for training our classifier.

Lets load this dataset and understand the data.

there are 5574 messages
(0, 'ham\tGo until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...')
(1, 'ham\tOk lar... Joking wif u oni...')
(2, "spam\tFree entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's")
(3, 'ham\tU dun say so early hor... U c already then say...')
(4, "ham\tNah I don't think he goes to usf, he lives around here though")
(5, "spam\tFreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, \xc2\xa31.50 to rcv")
(6, 'ham\tEven my brother is not like to speak with me. They treat me like aids patent.')
(7, "ham\tAs per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune")
(8, 'spam\tWINNER!! As a valued network customer you have been selected to receivea \xc2\xa3900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.')
(9, 'spam\tHad your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030')

There spam and ham messages are separated by tab. We can use a csv loader to read.

Let us investigate grouping by ham and spam and get some details

messages.groupby('label').describe()
messages['length'] = messages['message'].map(lambda text:len(text))
messages.head()
messages.length.describe()

Lets check mean, count. etc.
Top 75% of length of message is 122. Longest message is 910 words long, really long, so I would think that this is just 1 random anomaly.

count    5574.000000
mean 80.478292
std 59.848302
min 2.000000
25% 36.000000
50% 62.000000
75% 122.000000
max 910.000000
Name: length, dtype: float64

Lets plot ham and spam message lengths separately.

messages.hist(column='length', by='label', bins=50)

Processing Data , Tokenize

We need to break the sentences into tokens and stem before we can use it. Let’s create a method for tokenizing the words.

removing punctuations: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~0    [go, until, jurong, point, crazy, available, o...
1 [ok, lar, joking, wif, u, oni]
2 [free, entry, in, 2, a, wkly, comp, to, win, f...
3 [u, dun, say, so, early, hor, u, c, already, t...
4 [nah, i, dont, think, he, goes, to, usf, he, l...
Name: message, dtype: object

Feature Vector

we need to convert the tokenized words into a vector to feed into the ML algorithm. We will create a TF-IDF feature vector from the sentence. There are two steps to do this

  1. Create count vector from tokenized string
  2. Convert count vector into tf-idf vector
fv = CountVectorizer(analyzer=tokenize).fit(messages.message)print("Total number of words in array", len(fv.vocabulary_))

Output

('Total number of words in array', 9642)

Lets try to test how different words are represented in the CountVector.

print(fv.transform(["U dun"]))print("Second:")
print(fv.transform(["dun U"]))

Output

  (0, 3028)	1
(0, 8719) 1
Second:
(0, 3028) 1
(0, 8719) 1

Notice sentences have same count vector generated. When we fit the vectorizer it creates a vocabulary. This vocabulary is used to create a count vector for all sentences.

Let’s try to vectorize a full sentence from our corpus.

print(fv.transform([messages.message[3]]))

Output

  (0, 1136)	1
(0, 1931) 1
(0, 3028) 1
(0, 3051) 1
(0, 4261) 1
(0, 7281) 2
(0, 7701) 1
(0, 8369) 1
(0, 8719) 2

Count Vector Vocabulary (words_array)

iPhone does not have a CountVectorizer so we would need to do this step if we are to analyze a sentence. We will use the vocabulary of countvectorizer and save into words array file.

Words array file is the file with count positions of words and frequency of its occurance.

This is inturn used to calculate term frequency.

tf = Ft / Count(F)

Ft => frequency of term t in current document Count(F) => total number of words in corpus. (max of Ft in words array)

Let’s save this file

with open('words_array.json', 'wb') as fp:
json.dump(fv.vocabulary_, fp)

Messages Feature Vector

Let’s compose the feature vector for our entire corpus.

messages_fv = fv.transform(messages.message)
print(messages_fv.shape)

Output

(5574, 9642)

We will use TF-IDF transformer to transform the count vector of corpus into TF-IDF vector.

tfidf = TfidfTransformer().fit(messages_fv)# test tfidf of same message as before.
t = tfidf.transform(fv.transform([messages.message[3]]))
print(t)

Output

  (0, 8719)	0.305629866389
(0, 8369) 0.219784585189
(0, 7701) 0.187878620247
(0, 7281) 0.535840632872
(0, 4261) 0.444712923541
(0, 3051) 0.321265436126
(0, 3028) 0.295945795183
(0, 1931) 0.274800448767
(0, 1136) 0.267920316436

We see that the values are all normalized and weights are given according to TF-IDF importance. This gives more relevancy for the model to use.

Let’s compute the TF-IDF of the entire corpus.

messages_tfidf = tfidf.transform(messages_fv)
print(messages_tfidf.shape)

Output

(5574, 9642)

IDF (words_idf)

words idf is simple list of words and their IDF values

idf = log(N/Nt)

N => number of documents Nt => number of documets with word t

We need to compute TF-IDF of sentence in iOS and we will need the words_idf values as input for computing the vector. Let’s save the IDF array into a file. This is later used in the iOS code.

('IDF of corpus :', array([ 8.23975324,  8.52743531,  8.93290042, ...,  8.52743531,
6.98699027, 8.93290042]))

Model Training

For the model we will use a simple Linear SVM. SVM seems to be getting the most accurate results and we can easily use this model in iPhone as well. Let’s create and train a Linear SVC model.

CPU times: user 12.6 ms, sys: 1.53 ms, total: 14.1 ms
Wall time: 13.1 ms
('accuracy', 0.99910297811266591)
('confusion matrix\n', array(
[[4826, 1],
[ 4, 743]]))
(row=expected, col=predicted)

It looks like the model has got a really good accuracy. Confusion matrix is also showing great results.

Lets plot this and view this a little nicely

print(classification_report(messages[‘label’], predictions))

Output

             precision    recall  f1-score   support

ham 1.00 1.00 1.00 4827
spam 1.00 0.99 1.00 747

avg / total 1.00 1.00 1.00 5574

Convert to CoreML

This saves the model into a file. We will now write code for iOS to load and use this model.

2. Converting ML model into CoreML and embedding into iOS app

Let us create a Spam class and add methods into that class.

I am using iOS 11, xcode 9 with swift 4 for code below.

vocabulary is the words_array that we saved earlier.

idf is the words_idf

norm or normalization holds whether the values should be normalized when generating tfidf.

Loading the files

lets load the json files and save into class members.

Tokenize

First step is to tokenize the message, Let’s write a function for it. Trim the punctuations got earlier.

Count Vector

next we need to create a count vector of the tokens. We’ll just write method which takes the entire sentence and returns the count vector

TF-IDF Vector

Now its time to create a TF-IDF vector of the sentence.

Notice the second half of TF-IDF. Earlier when we used python TfidfVectorizer, it uses L2 Normalization on the results. We also need to use L2 Norm on the values before we feed into the model otherwise we will not get correct classification.

Verify some data if needed here to make sure TF-IDF computation is exactly same as the python TF-IDF.

MLMultiArray

We are not done yet. CoreML expects in MLMultiArray. lets write a small routine to convert our TFIDF vector into MLMultiArray.

Ofcourse, it would be lot more efficient if we used MLMultiArray from the beginning.

Predict

Prediction is as easy as

make sure to add mlmodel and two json files created earler into the bundle.

Next Steps

  • Build message extension and integrate the offline model into the extension
  • collect more relevant data and train new modern dataset

Source / Download Links

Data Set Details:
SMS Spam Collection UCI Data Set

References

Machine Learning Techniques in Spam Filtering

Y Media Labs Innovation

Engineering blog showcasing some innovation and creativity

Thanks to Prasad Pai

Darshan Sonde

Written by

Director of Technology, Head of Innovation @ymedialabs

Y Media Labs Innovation

Engineering blog showcasing some innovation and creativity

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade