My First Foray into Spark MLlib

Published in

Red Ventures Data Science & Engineering

8 min readMar 19, 2018

As a data science team that’s always working to increase efficiency, especially when it comes to pushing a predictive model into production, we are constantly looking to improve our tech stack and methodology. Our team has begun the move to Spark in an effort to improve the distribution of load on our machines when training models and the speed and ease of serializing those models in a way that allows real-time scoring. For someone who has worked exclusively within the confines of R and Python, and feels much more comfortable using packages such as scikit-learn for modeling and dplyr for data transformations, the learning curve to jump over to Spark seemed dauntingly steep. On top of all that, for serialization purposes (see our lead engineer’s post here) we also made the decision to move to Scala, a language I had no experience working with. Piece of cake, right?

In this blog I share the 5 essential concepts I wish someone would have shared with me before my first Spark project and thus hopefully aid in your first attempt at leveraging Spark.

1. Scala can be Object Oriented

While this may not be new to some, and wasn’t even new to me when I moved over to Scala, I found it helpful to take a refresher on what it means to work with an object-oriented programming language. Much like Python, Scala has the functionality of being a language made up of class objects, that have variables and methods. While it also has the ability to be a functional language, when using MLlib you interact with data by instantiating objects and calling methods of those objects to create the desired effect. Here’s a visual I found helpful:

So what would creating and then using this class look like in practice? Check out below:

class Student{  
    var name:String = null;                         
    var rollNumber:Int = 0;    def setName(value:String):Unit = name = value
    def setRollNumber(num:Int):Unit = rollNumber = num}val newStudent = new Student.setName("John").setRollNumber(102)

This is looks a bit different from your functional R code that requires the object to be passed as an argument to a stand alone function.

I don’t want to dwell on this too long though, so for more information to help you in your pursuit of mastering the object-oriented nature of Scala check out this helpful documentation here.

2. Spark SQL

“Spark SQL is the real deal. Dplyr tries to be SQL but Spark SQL actually is”. These were the words a coworker of mine told me at the onset of my project that I don’t think I appreciated nearly enough at the time. I was busy feeling lost without my beloved mutate, filter, select, group_by, and summarise functions. What I soon found out, though, was that every single grouping of piped dplyr functions I had ever used could be represented in an even simpler query. Yes it takes a little time to reset the way your mind wants to think about cleaning data but, if anything, SQL gives you even more functionality than you had before.

import org.apache.spark.sql.DataFrame
yourDataFrame.createOrReplaceTempView("yourTemporaryView")

Taking a DataFrame and creating a temporary view that you can treat like a hive or SQL table in Spark SQL is as easy as the statement above. Once you’ve created your temporary view you can run any query statement using it to create a new DataFrame (as seen below).

import org.apache.spark.sql.functions._val newDataFrame = sqlContext.sql("select * from yourTemporaryView where 1=1")

The syntax may be slightly different from what you’re used to in some instances. For example I have used transactSQL for the last two years, so it caught me by surprise when I was able to use an if() function in place of an iif() function.

To understand the nuances and other capabilities of Spark SQL I highly recommend you explore it more for yourself here.

3. RDDs Are Tricky

… and need to be fully understood before used. An RDD is a Resilient Distributed Dataset and can be used in Spark to store data. What makes them great is that the memory needed to store these objects can be distributed across many machines, making big data much more manageable to work with.

So why are they so difficult to work with? Because storing information in many places can make it much more challenging to manipulate. If you’re like me and enjoy understanding your data set before modeling, you’re mostly likely performing cleaning and grouping functions on it. When it comes to operations like aggregation, or finding average, it becomes really hard to code using an RDD. You’ll also find run times to be much longer when trying to run operations like count.

This blog post does a good job explaining the functionality of an RDD, as well as other objects in Spark used to store data, in a way that compares the tradeoffs of each.

All the learnings I’ve brought up thus far came about trying to navigate Spark in general. Now let’s take a closer look at what makes MLlib unique.

4. Pipeline Stages: Transformers and Estimators

To preface what I really want to get to here, I think it’s important to quickly visit the idea of a machine learning Pipeline. In a slightly over simplified explanation, you can think about it as an object that takes in a raw data set and outputs a Pipeline Model, which we’ll ultimately use to make predictions (I’ll get into differences between Pipelines and Pipeline Models a little later).

It’s important to note as well that a Pipeline is made up of individual steps, or stages that determine the order of operations on data as it flows through. These stages can be either Transformers or Estimators. So how do you know what to choose when manipulating your data?

To be able to decide you must first understand the difference between the two. A Transformer is a class object that contains a function, takes in a data set, and returns a new data set with a column(s) appended in the form of the output of the function. Hopefully the visual below helps illustrate this.

So what is an Estimator? You can think about an Estimator as an object that needs to learn additional information about your data set in order to become a Transformer. A great example of this is imputation with the mean on a numeric column. In order to properly impute missing values, your Estimator must first learn the mean of the values that exist. Once this value is learned, the Estimator will store it away and become a Transformer that replaces missing values in your column with that stored off mean.

Some great examples of Estimators are one-hot-encoders (which need to learn the levels of your categorical column) and logistic regression (which needs to learn the coefficients of your feature set). Examples of Transformers include tokenizers and vector assemblers.

So how do you “magically” teach Estimators the information they need to become Transformers? It’s as simple as using their fit method.

import org.apache.spark.ml.classification.LogisticRegressionval lrEstimator = new LogisticRegression()val lrTransformer = lrEstimator.fit(trainingData)

We don’t want to have to train all our Estimators individually, so fortunately we can assemble all our Estimators and Transformers in a pipeline and train them at the same time. For more information check out this well laid out documentation.

5. Pipelines vs Pipeline Models

I’ve brought up Pipelines already in talking about the stages that define them, but what are they really? This is a concept I thought I understood going into my first Spark project but found I was sorely mistaken when the topic of Pipeline Models was brought up. My first thought was that these are interchangeable ways of describing the same object, but they aren’t.

Pipelines and Pipeline Models have a similar relationship to Transformers and Estimators. We now know that an Estimator is an object that needs to learn your data’s context in order to provide you with a Transformer. A Pipeline needs to similarly learn your data’s context, or be “fit” on your training data, to provide you with a Pipeline Model.

The Pipeline object above can be turned into a pipeline model using the fit method, just like an Estimator

import org.apache.spark.ml.{Pipeline, PipelineModel}val ourPipeline = new Pipeline()
  .setStages(Array(tokenizer, hashingTF, lrEstimator))val ourPipelineModel = ourPipeline.fit(trainingData)

It’s really that simple. But why do we want a Pipeline Model instead of a Pipeline? Simply put, a Pipeline Models is the object with the transform method that allows us to apply the same algorithm to new data. You can think about this like using the predict method in scikit-learn.

In order to predict on additional data you need only do the following:

val transformedTestingData = ourPipelineModel.transform(testingData)

This transformed data should have all the additional columns your Transformers and Estimators were set to create.

For additional learning, I have found the code examples in this blog post very helpful in making these concepts more concrete.

Wrapping Up

There is far more to learn about Spark before you feel comfortable as a developer with MLlib. I understand I haven’t begun to scratch the surface of further concepts such as creating your own custom transformers, serializing Pipeline Models for production purposes, or working with various data source connections to make your data available. But that’s ok! All I hope is that this post has given you a jump start in getting off the ground with your first Spark adventure.

I now encourage you all to go out there, pick a project (maybe even one you’ve already written in R or python) and dive in. You should now have a general understanding of what you’ll use to clean your data (Spark SQL), transform your data (Transformers), model your data (Estimators), and fit your model (Pipeline fit method). Happy Coding!

Interested in building machine learning pipelines in Apache Spark with a fast-paced, outcome oriented team? Check us out at redventures.com.