Building a spam classifier: PySpark+MLLib vs SageMaker+XGBoost

In this article, I will first show you how to build a spam classifier using Apache Spark, its Python API (aka PySpark) and a variety of Machine Learning algorithms implemented in Spark MLLib.

Then, we will use the new Amazon Sagemaker service to train, save and deploy an XGBoost model trained on the same data set.

“I must break you”

All code runs in a Jupyter notebook, available on Github :)

PySpark + MLLib

The big picture

Our raw data set is composed of 1-line messages stored in two files:

  • the ‘ham’ file: 4827 valid messages,
  • the ‘spam’ file: 747 messages.

In order to classify these messages, we need to build an intermediate data set with two classes. For this purpose, we’re going to use a simple but efficient technique called Feature Hashing:

  • For each message in the data set, we first hash its words into a fixed number of buckets (say, 1000).
  • Then, we build a vector indicating non-zero occurrences for each word: these are the features that will be used to decide whether a message is spam or not.
  • For a valid message, the corresponding label will be zero, i.e. the message is not spam. Accordingly, for a spam message, the label will be one.

Once we’re done, our intermediate data set will be:

  • 4827 word vectors labeled with a zero,
  • 747 word vectors labeled with a one.

We’ll split it 80/20 for training and validation and run in through a number of classification algorithms.

For prediction, the process will be similar: hash the message, send the word vector to the model and get the predicted result.

Not that difficult, hey? Let’s get to work!

Building the intermediate data set

Our first step is to load both files and split the messages into words.

Then, we’re hashing each message into 1,000 word buckets. As you can see, each message is turned into a sparse vector holding bucket numbers and occurrences.

The next step is to label our features: 1 for spam, 0 for non-spam. The result is a collected of labeled samples which are ready for use.

Finally, we split the data set 80/20 for training and test and cache both RDDs as we will use them repeatedly.

Now we’re going to train a number of models with this data set. To measure their accuracy, here’s the scoring function we’re going to use: simply predict all samples in the test set, compare the predicted label with the real label and compute accuracy.

Classifying the data set with Spark MLLib

We’re going to use the following classification algorithms:

Logistic Regression

Let’s start with Logistic Regression, the mother of all classifiers.

Support Vector Machines

What about SVMs, another popular algorithm?


Now let’s try three variants of tree-based classification. The API is slightly different from previous algos.

Naive Bayes

Last but not least, let’s try the Naives Bayes classifier.

It is vastly superior to all other algos. Let’s try to predict a couple of real-life samples.

They were predicted correctly. This looks like a pretty good model. Now why don’t try to improve these scores? I’ve used default parameters for most of the algorithms, surely there is room for improvement :) You’ll find links to all APIs in the notebook, so feel free to tweak away!

This is great, but…

So far, we’ve only worked locally. This raises some questions:

  1. how would we train on a much larger data set?
  2. how would we deploy our model to production?
  3. how could we know if our model would scale?

These questions — scalability and deployment — are often the bane of Machine Learning projects. Going from “it works on my machine” to “it works in production at scale 24/7” usually requires a lot of work.

There is hope. Read on :)

SageMaker + XGBoost

Solving these pain points is at the core of Amazon SageMaker. Let’s revisit our use case.

Built-in algorithms

As we saw previously, there are plenty of classification algorithms. Picking the “right” one and its “best” implementation (good luck trying to define “right” and “best”) is not an easy task. Fortunately, SageMaker provides you with several built-in algorithms. They have been implemented by Amazon, so I guess you can expect them to perform and scale correctly :)

You can also bring your own code, your own pre-trained model, etc. To be discussed in future articles! More SageMaker examples on Github: regression, multi-class classification, image classification, etc.

Here, we’re going to use XGBoost, a popular implementation of Gradient Boosted Trees to build a binary classifier.

In a nutshell, the SageMaker SDK will let us:

  • create managed infrastructure to train XGBoost on our data set,
  • store the model in SageMaker,
  • configure a REST endpoint to serve our model,
  • create managed infrastructure to deploy the model to the REST endpoint,
  • invoke the model on a couple of samples.

Let’s do this!

Setting up storage and data

First things first: S3 will be used to store the data set and all artifacts (what a surprise). Let’s declare a few things, then. Hint: the S3 bucket must be in the same region as SageMaker.

This implementation of XGBoost requires data to be either in CSV or libsvm format. Let’s try the latter, copy the resulting files to S3 and grab the SageMaker IAM role.

Looking good. Now let’s set up the training job.

Setting up the training job

Amazon SageMaker uses Docker containers to run training jobs. We need to pick the container name corresponding to the region we’re running in.

Easy enough. Time to configure training. We’re going to:

  • Build a binary classifier,
  • Fetch the training and validation data sets in libsvm format from S3,
  • Train for 100 iterations a single m4.4xlarge instance.

That’s quite a mouthful, but don’t panic:

  • Parameters common to all algorithms are defined in the CreateTrainingJob API documentation.
  • Algorithm-specific parameters are defined on the algorithm page, e.g. XGBoost.

Training and saving the model

OK, let’s get this party going. Time to start training.

6 minutes later, our model is ready. Of course, this is a bit long for such a small data set :) However, if we had millions of lines, we could have started a training job on multiple instances with the exact same code. Pretty cool, huh?

OK, let’s save this model in SageMaker. Pretty straightforward with the CreateModel API.

Creating the endpoint

Here comes the really good part. We’re going to deploy this model and invoke it. Yes, just like that.

First, we need to create an endpoint configuration with the CreateEndpointConfig API: we’ll use a single m4.xlarge for inference, with 100% of traffic going to our model (we’ll look at A/B testing in a future post).

Deploying the model

Now we can deploy our trained model on this endpoint with the CreateEndpoint API.

Invoking the endpoint

We’re now ready to invoke the endpoint. Let’s grab a couple of samples (in libsvm format) from the data set and predict them.

Both samples are predicted correctly. Woohoo.


As you can see, SageMaker helps you run your Machine Learning projects end to end: notebook experimentation, model training, model hosting, model deployment.

If you’re curious about other ways you can use SageMaker (and if you can’t wait for the inevitable future posts!), here’s a overview I recorded recently.

That’s it for today. Thank you very much for reading.

This monster post was written while listening over and over (it WAS a long post) to this legendary Foreigner show from 1981.