Building offline iPhone spam classifier using CoreML

Published in

YML Innovation Lab

8 min readOct 30, 2017

iOS 11 introduced message extension to filter spam messages and coreml to build custom machine learned models to predict spam or not. In this article I’ll go over all the steps I took to build a machine learning model and how I added it to an iphone project to predict spam.

Building a ML model to predict spam
Converting ML model into coreml and embedding into iOS app

End Result

1. Building ML model to predict spam

You can view this part in notebook here

I use python 2.7 for all code samples below.

Data

We will use UCI Spam Collection Data Set for training our classifier.

Lets load this dataset and understand the data.

there are 5574 messages
(0, 'ham\tGo until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...')
(1, 'ham\tOk lar... Joking wif u oni...')
(2, "spam\tFree entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's")
(3, 'ham\tU dun say so early hor... U c already then say...')
(4, "ham\tNah I don't think he goes to usf, he lives around here though")
(5, "spam\tFreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, \xc2\xa31.50 to rcv")
(6, 'ham\tEven my brother is not like to speak with me. They treat me like aids patent.')
(7, "ham\tAs per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune")
(8, 'spam\tWINNER!! As a valued network customer you have been selected to receivea \xc2\xa3900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.')
(9, 'spam\tHad your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030')

There spam and ham messages are separated by tab. We can use a csv loader to read.

Let us investigate grouping by ham and spam and get some details

messages.groupby('label').describe()

messages['length'] = messages['message'].map(lambda text:len(text))
messages.head()

messages.length.describe()

Lets check mean, count. etc.
Top 75% of length of message is 122. Longest message is 910 words long, really long, so I would think that this is just 1 random anomaly.

count    5574.000000
mean       80.478292
std        59.848302
min         2.000000
25%        36.000000
50%        62.000000
75%       122.000000
max       910.000000
Name: length, dtype: float64

Lets plot ham and spam message lengths separately.

messages.hist(column='length', by='label', bins=50)

Processing Data , Tokenize

We need to break the sentences into tokens and stem before we can use it. Let’s create a method for tokenizing the words.

removing punctuations: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~0    [go, until, jurong, point, crazy, available, o...
1                       [ok, lar, joking, wif, u, oni]
2    [free, entry, in, 2, a, wkly, comp, to, win, f...
3    [u, dun, say, so, early, hor, u, c, already, t...
4    [nah, i, dont, think, he, goes, to, usf, he, l...
Name: message, dtype: object

Feature Vector

we need to convert the tokenized words into a vector to feed into the ML algorithm. We will create a TF-IDF feature vector from the sentence. There are two steps to do this

Create count vector from tokenized string
Convert count vector into tf-idf vector

fv = CountVectorizer(analyzer=tokenize).fit(messages.message)print("Total number of words in array", len(fv.vocabulary_))

Output

('Total number of words in array', 9642)

Lets try to test how different words are represented in the CountVector.

print(fv.transform(["U dun"]))print("Second:")
print(fv.transform(["dun U"]))

Output

  (0, 3028)	1
  (0, 8719)	1
Second:
  (0, 3028)	1
  (0, 8719)	1

Notice sentences have same count vector generated. When we fit the vectorizer it creates a vocabulary. This vocabulary is used to create a count vector for all sentences.

Let’s try to vectorize a full sentence from our corpus.

print(fv.transform([messages.message[3]]))

Output

  (0, 1136)	1
  (0, 1931)	1
  (0, 3028)	1
  (0, 3051)	1
  (0, 4261)	1
  (0, 7281)	2
  (0, 7701)	1
  (0, 8369)	1
  (0, 8719)	2

Count Vector Vocabulary (words_array)

iPhone does not have a CountVectorizer so we would need to do this step if we are to analyze a sentence. We will use the vocabulary of countvectorizer and save into words array file.

Words array file is the file with count positions of words and frequency of its occurance.

This is inturn used to calculate term frequency.

tf = Ft / Count(F)

Ft => frequency of term t in current document Count(F) => total number of words in corpus. (max of Ft in words array)

Let’s save this file

with open('words_array.json', 'wb') as fp:
    json.dump(fv.vocabulary_, fp)

Messages Feature Vector

Let’s compose the feature vector for our entire corpus.

messages_fv = fv.transform(messages.message)
print(messages_fv.shape)

Output

(5574, 9642)

We will use TF-IDF transformer to transform the count vector of corpus into TF-IDF vector.

tfidf = TfidfTransformer().fit(messages_fv)# test tfidf of same message as before.
t = tfidf.transform(fv.transform([messages.message[3]]))
print(t)

Output

  (0, 8719)	0.305629866389
  (0, 8369)	0.219784585189
  (0, 7701)	0.187878620247
  (0, 7281)	0.535840632872
  (0, 4261)	0.444712923541
  (0, 3051)	0.321265436126
  (0, 3028)	0.295945795183
  (0, 1931)	0.274800448767
  (0, 1136)	0.267920316436

We see that the values are all normalized and weights are given according to TF-IDF importance. This gives more relevancy for the model to use.

Let’s compute the TF-IDF of the entire corpus.

messages_tfidf = tfidf.transform(messages_fv)
print(messages_tfidf.shape)

Output

(5574, 9642)

IDF (words_idf)

words idf is simple list of words and their IDF values

idf = log(N/Nt)

N => number of documents Nt => number of documets with word t

We need to compute TF-IDF of sentence in iOS and we will need the words_idf values as input for computing the vector. Let’s save the IDF array into a file. This is later used in the iOS code.

('IDF of corpus :', array([ 8.23975324,  8.52743531,  8.93290042, ...,  8.52743531,
        6.98699027,  8.93290042]))

Model Training

For the model we will use a simple Linear SVM. SVM seems to be getting the most accurate results and we can easily use this model in iPhone as well. Let’s create and train a Linear SVC model.

CPU times: user 12.6 ms, sys: 1.53 ms, total: 14.1 ms
Wall time: 13.1 ms
('accuracy', 0.99910297811266591)
('confusion matrix\n', array(
      [[4826,    1],
       [   4,  743]]))
(row=expected, col=predicted)

It looks like the model has got a really good accuracy. Confusion matrix is also showing great results.

Lets plot this and view this a little nicely

print(classification_report(messages[‘label’], predictions))

Output

             precision    recall  f1-score   support

        ham       1.00      1.00      1.00      4827
       spam       1.00      0.99      1.00       747

avg / total       1.00      1.00      1.00      5574

Convert to CoreML

This saves the model into a file. We will now write code for iOS to load and use this model.

2. Converting ML model into CoreML and embedding into iOS app

Let us create a Spam class and add methods into that class.

I am using iOS 11, xcode 9 with swift 4 for code below.

vocabulary is the words_array that we saved earlier.

idf is the words_idf

norm or normalization holds whether the values should be normalized when generating tfidf.

Loading the files

lets load the json files and save into class members.

Tokenize

First step is to tokenize the message, Let’s write a function for it. Trim the punctuations got earlier.

Count Vector

next we need to create a count vector of the tokens. We’ll just write method which takes the entire sentence and returns the count vector

TF-IDF Vector

Now its time to create a TF-IDF vector of the sentence.

Notice the second half of TF-IDF. Earlier when we used python TfidfVectorizer, it uses L2 Normalization on the results. We also need to use L2 Norm on the values before we feed into the model otherwise we will not get correct classification.

Verify some data if needed here to make sure TF-IDF computation is exactly same as the python TF-IDF.

MLMultiArray

We are not done yet. CoreML expects in MLMultiArray. lets write a small routine to convert our TFIDF vector into MLMultiArray.

Ofcourse, it would be lot more efficient if we used MLMultiArray from the beginning.

Predict

Prediction is as easy as

make sure to add mlmodel and two json files created earler into the bundle.

Next Steps

Build message extension and integrate the offline model into the extension
collect more relevant data and train new modern dataset

Source / Download Links

darcwader/DSpam

DSpam - Building offline iPhone spam classifier using CoreML

github.com

Data Set Details:
SMS Spam Collection UCI Data Set

References

Machine Learning Techniques in Spam Filtering

Tokenizing Natural Language Text | Apple Developer Documentation

let text = """ All human beings are born free and equal in dignity and rights. They are endowed with reason and…

developer.apple.com

iMessage Spam Detection with CoreML

A couple years ago, iMessage spam started becoming annoying enough to be reported on by a variety of major news sites…

gokulswamy.me

How to build your first Machine Learning model on iPhone (Intro to Apple's CoreML)

Introduction The data scientist in me is living a dream - I can see top tech companies coming out with products close…

www.analyticsvidhya.com

Converting Trained Models to Core ML | Apple Developer Documentation

Convert trained models created with third-party machine learning tools to the Core ML model format.

developer.apple.com

CoreML from Scratch

CoreML is the new framework from Apple to easily incorporate machine learning models in to your iOS and macOS…

medium.com

Using prediction models with CoreML - Pusher Blog

Machine learning allows computers to learn without being explicitly programmed. It's a hot and complex topic that you…

blog.pusher.com

Using scikit-learn and CoreML to Create a Music Recommendation Engine

Since the announcement of CoreML at WWDC this year I have been very excited to get in and start researching all the…

www.agnosticdev.com

Porter Stemming Algorithm

This is the 'official' home page for distribution of the Porter Stemming Algorithm, written and maintained by its…

tartarus.org

Building offline iPhone spam classifier using CoreML

End Result

1. Building ML model to predict spam

Data

Processing Data , Tokenize

Feature Vector

Count Vector Vocabulary (words_array)

Messages Feature Vector

IDF (words_idf)

Model Training

Convert to CoreML

2. Converting ML model into CoreML and embedding into iOS app

Loading the files

Tokenize

Count Vector

TF-IDF Vector

MLMultiArray

Predict

Next Steps

Source / Download Links

darcwader/DSpam

DSpam - Building offline iPhone spam classifier using CoreML

References

Tokenizing Natural Language Text | Apple Developer Documentation

let text = """ All human beings are born free and equal in dignity and rights. They are endowed with reason and…

iMessage Spam Detection with CoreML

A couple years ago, iMessage spam started becoming annoying enough to be reported on by a variety of major news sites…

How to build your first Machine Learning model on iPhone (Intro to Apple's CoreML)

Introduction The data scientist in me is living a dream - I can see top tech companies coming out with products close…

Converting Trained Models to Core ML | Apple Developer Documentation

Convert trained models created with third-party machine learning tools to the Core ML model format.

CoreML from Scratch

CoreML is the new framework from Apple to easily incorporate machine learning models in to your iOS and macOS…

Using prediction models with CoreML - Pusher Blog

Machine learning allows computers to learn without being explicitly programmed. It's a hot and complex topic that you…

Using scikit-learn and CoreML to Create a Music Recommendation Engine

Since the announcement of CoreML at WWDC this year I have been very excited to get in and start researching all the…

Porter Stemming Algorithm

This is the 'official' home page for distribution of the Porter Stemming Algorithm, written and maintained by its…

Written by Darshan Sonde