Building offline iPhone spam classifier using CoreML
iOS 11 introduced message extension to filter spam messages and coreml to build custom machine learned models to predict spam or not. In this article I’ll go over all the steps I took to build a machine learning model and how I added it to an iphone project to predict spam.
- Building a ML model to predict spam
- Converting ML model into coreml and embedding into iOS app
End Result
1. Building ML model to predict spam
You can view this part in notebook here
I use python 2.7 for all code samples below.
Data
We will use UCI Spam Collection Data Set for training our classifier.
Lets load this dataset and understand the data.
there are 5574 messages
(0, 'ham\tGo until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...')
(1, 'ham\tOk lar... Joking wif u oni...')
(2, "spam\tFree entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's")
(3, 'ham\tU dun say so early hor... U c already then say...')
(4, "ham\tNah I don't think he goes to usf, he lives around here though")
(5, "spam\tFreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, \xc2\xa31.50 to rcv")
(6, 'ham\tEven my brother is not like to speak with me. They treat me like aids patent.')
(7, "ham\tAs per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune")
(8, 'spam\tWINNER!! As a valued network customer you have been selected to receivea \xc2\xa3900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.')
(9, 'spam\tHad your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030')
There spam and ham messages are separated by tab. We can use a csv loader to read.
Let us investigate grouping by ham and spam and get some details
messages.groupby('label').describe()
messages['length'] = messages['message'].map(lambda text:len(text))
messages.head()
messages.length.describe()
Lets check mean, count. etc.
Top 75% of length of message is 122. Longest message is 910 words long, really long, so I would think that this is just 1 random anomaly.
count 5574.000000
mean 80.478292
std 59.848302
min 2.000000
25% 36.000000
50% 62.000000
75% 122.000000
max 910.000000
Name: length, dtype: float64
Lets plot ham and spam message lengths separately.
messages.hist(column='length', by='label', bins=50)
Processing Data , Tokenize
We need to break the sentences into tokens and stem before we can use it. Let’s create a method for tokenizing the words.
removing punctuations: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~0 [go, until, jurong, point, crazy, available, o...
1 [ok, lar, joking, wif, u, oni]
2 [free, entry, in, 2, a, wkly, comp, to, win, f...
3 [u, dun, say, so, early, hor, u, c, already, t...
4 [nah, i, dont, think, he, goes, to, usf, he, l...
Name: message, dtype: object
Feature Vector
we need to convert the tokenized words into a vector to feed into the ML algorithm. We will create a TF-IDF feature vector from the sentence. There are two steps to do this
- Create count vector from tokenized string
- Convert count vector into tf-idf vector
fv = CountVectorizer(analyzer=tokenize).fit(messages.message)print("Total number of words in array", len(fv.vocabulary_))
Output
('Total number of words in array', 9642)
Lets try to test how different words are represented in the CountVector.
print(fv.transform(["U dun"]))print("Second:")
print(fv.transform(["dun U"]))
Output
(0, 3028) 1
(0, 8719) 1
Second:
(0, 3028) 1
(0, 8719) 1
Notice sentences have same count vector generated. When we fit the vectorizer it creates a vocabulary. This vocabulary is used to create a count vector for all sentences.
Let’s try to vectorize a full sentence from our corpus.
print(fv.transform([messages.message[3]]))
Output
(0, 1136) 1
(0, 1931) 1
(0, 3028) 1
(0, 3051) 1
(0, 4261) 1
(0, 7281) 2
(0, 7701) 1
(0, 8369) 1
(0, 8719) 2
Count Vector Vocabulary (words_array)
iPhone does not have a CountVectorizer so we would need to do this step if we are to analyze a sentence. We will use the vocabulary of countvectorizer and save into words array file.
Words array file is the file with count positions of words and frequency of its occurance.
This is inturn used to calculate term frequency.
tf = Ft / Count(F)
Ft => frequency of term t in current document Count(F) => total number of words in corpus. (max of Ft in words array)
Let’s save this file
with open('words_array.json', 'wb') as fp:
json.dump(fv.vocabulary_, fp)
Messages Feature Vector
Let’s compose the feature vector for our entire corpus.
messages_fv = fv.transform(messages.message)
print(messages_fv.shape)
Output
(5574, 9642)
We will use TF-IDF transformer to transform the count vector of corpus into TF-IDF vector.
tfidf = TfidfTransformer().fit(messages_fv)# test tfidf of same message as before.
t = tfidf.transform(fv.transform([messages.message[3]]))
print(t)
Output
(0, 8719) 0.305629866389
(0, 8369) 0.219784585189
(0, 7701) 0.187878620247
(0, 7281) 0.535840632872
(0, 4261) 0.444712923541
(0, 3051) 0.321265436126
(0, 3028) 0.295945795183
(0, 1931) 0.274800448767
(0, 1136) 0.267920316436
We see that the values are all normalized and weights are given according to TF-IDF importance. This gives more relevancy for the model to use.
Let’s compute the TF-IDF of the entire corpus.
messages_tfidf = tfidf.transform(messages_fv)
print(messages_tfidf.shape)
Output
(5574, 9642)
IDF (words_idf)
words idf is simple list of words and their IDF values
idf = log(N/Nt)
N => number of documents Nt => number of documets with word t
We need to compute TF-IDF of sentence in iOS and we will need the words_idf values as input for computing the vector. Let’s save the IDF array into a file. This is later used in the iOS code.
('IDF of corpus :', array([ 8.23975324, 8.52743531, 8.93290042, ..., 8.52743531,
6.98699027, 8.93290042]))
Model Training
For the model we will use a simple Linear SVM. SVM seems to be getting the most accurate results and we can easily use this model in iPhone as well. Let’s create and train a Linear SVC model.
CPU times: user 12.6 ms, sys: 1.53 ms, total: 14.1 ms
Wall time: 13.1 ms
('accuracy', 0.99910297811266591)
('confusion matrix\n', array(
[[4826, 1],
[ 4, 743]]))
(row=expected, col=predicted)
It looks like the model has got a really good accuracy. Confusion matrix is also showing great results.
Lets plot this and view this a little nicely
print(classification_report(messages[‘label’], predictions))
Output
precision recall f1-score support
ham 1.00 1.00 1.00 4827
spam 1.00 0.99 1.00 747
avg / total 1.00 1.00 1.00 5574
Convert to CoreML
This saves the model into a file. We will now write code for iOS to load and use this model.
2. Converting ML model into CoreML and embedding into iOS app
Let us create a Spam class and add methods into that class.
I am using iOS 11, xcode 9 with swift 4 for code below.
vocabulary is the words_array that we saved earlier.
idf is the words_idf
norm or normalization holds whether the values should be normalized when generating tfidf.
Loading the files
lets load the json files and save into class members.
Tokenize
First step is to tokenize the message, Let’s write a function for it. Trim the punctuations got earlier.
Count Vector
next we need to create a count vector of the tokens. We’ll just write method which takes the entire sentence and returns the count vector
TF-IDF Vector
Now its time to create a TF-IDF vector of the sentence.
Notice the second half of TF-IDF. Earlier when we used python TfidfVectorizer, it uses L2 Normalization on the results. We also need to use L2 Norm on the values before we feed into the model otherwise we will not get correct classification.
Verify some data if needed here to make sure TF-IDF computation is exactly same as the python TF-IDF.
MLMultiArray
We are not done yet. CoreML expects in MLMultiArray. lets write a small routine to convert our TFIDF vector into MLMultiArray.
Ofcourse, it would be lot more efficient if we used MLMultiArray from the beginning.
Predict
Prediction is as easy as
make sure to add mlmodel and two json files created earler into the bundle.
Next Steps
- Build message extension and integrate the offline model into the extension
- collect more relevant data and train new modern dataset
Source / Download Links
Data Set Details:
SMS Spam Collection UCI Data Set