Apache Spark and Amazon SageMaker, the Infinity Gems of analytics

In a previous post, I showed you how to build a spam classifier by running PySpark on an Amazon SageMaker notebook instance.

This is a fine setup for experimentation but neither scalable nor automated enough for production. Now, let’s go all the way and use a proper Amazon EMR cluster running Spark and using the SageMaker Spark SDK to fire up training jobs.

Data Science Thanos: “With the SageMaker and Spark gems, even the mightiest data sets will bow to my will!”.
We also briefly touched upon the “why?”. There’s more to it and I just recorded a webinar on this topic… but I don’t want to spoil it :) Just rest assured that half the living universe *doesn’t* die at the end. I’ll share the video when it’s live on YouTube.

SDK examples

The SageMaker documentation includes the following examples of using SageMaker and Spark:

  • Classifying MNIST with the built-in XGBoost algorithm (PySpark).
  • Clustering MNIST with the built-in K-Means algorithm (PySpark and Scala).
  • Clustering MNIST with a Spark pipeline, running the PCA algorithm in MLlib and the built-in K-Means algorithm in SageMaker (Scala).

Copying and pasting from web pages is unpleasant, so I did it for you. I’m happy to share these examples in text and Zeppelin format on Github.

Zeppelin is pre-installed on EMR. Here’s how to set up remote connectivity from your machine to the EMR web applications.

Just make sure you update the IAM role used in the code (account number, role name) and you’re all set. Don’t thank me: I’m just doing my job :*)

I would suggest running all these examples and reading the SageMaker Spark SDK documentation to making sure you understand the finer points. Not that there are many: this felt pretty straightforward when I did, so it shouldn’t take you long either.

Revisiting our spam classifier

We’ll use the exact same technique as in the previous post (shamelessly copying and pasting here, ha!).

Our raw data set is composed of 1-line messages stored in two files:
- the ‘ham’ file: 4827 valid messages,
- the ‘spam’ file: 747 messages.
In order to classify these messages, we need to build an intermediate data set with two classes. For this purpose, we’re going to use a simple but efficient technique called Feature Hashing:
For each message in the data set, we first hash its words into a fixed number of buckets.
Then, we build a vector indicating non-zero occurrences for each word: these are the features that will be used to decide whether a message is spam or not.
For a valid message, the corresponding label will be zero, i.e. the message is not spam. Accordingly, for a spam message, the label will be one.
Finally, we split the data set 80/20 for training and validation

Nothing new here. I just added is a bit of clean-up: convert everything to lower case, remove all punctuation and numbers, trim white spaces. In real-life, you’d certainly want to do more (stemming, etc.).

Once we’re done, the data set looks like this: each line holds a label and a feature vector.

Training with XGBoost

Now we get to the good stuff: let’s use the high-level API in the SageMaker Spark SDK to train with XGBoost.

First, we have to convert the training set to libsvm format, which is what XGBoost expects. Fortunately, Spark has a friendly API for this.

Most if not all other built-in algorithms can train on protobuf data: the SageMaker Spark SDK will automatically handle conversion between DataFrames and protobuf. XGBoost is an open-source algorithm, which is why it’s a bit different in this respect.

Then, we set up our training and prediction infrastructure in a single API call: this is even simpler than the regular SageMaker SDK.

Finally, we set up hyper parameters and start training.

This fires up our training instance from Spark and after a few minutes, voila! Our model has been trained…

… and deployed.

Isn’t this the simplest integration ever? :) If you needed to train on 100 instances, things wouldn’t be any different.

Predicting with XGBoost

Predicting is simple too: we just need convert the test data set to libsvm and predict.

Here’s what the predicted data and the accuracy look like.

What about the EMR cluster? Well, it didn’t even blink. It takes more than this tiny data set to put four m4.2xlarge instances to work :)

Still, you could be processing huge data sets in exactly the same way:

You don’t usually get the best of both worlds in life (most of us would settle for the best of *one* world, right?). However, when it comes to analytics at scale, this Spark-SageMaker combo has a lot going for it.

Very curious to see what you will build with this, so go and shake the universe, my friends!

That’s it for today. As always, thank you for reading. Happy to answer questions here or on Twitter.


I guess this advice also stands for analytics at scale \m/