MLeap: Quickly Release Spark ML Pipelines

Driven by Code
Driven by Code
Published in
5 min readMar 21, 2016

MLeap: Release Spark ML Pipelines
MLeap allows you to quickly deploy your Spark-trained ML pipelines to production. Learn how to go from a raw Avro file to serving a random forest price-prediction model from a JSON API server in just a few minutes.

For instance, if you work at a company that uses machine learning, this problem may be all too familiar: data scientists use a myriad tools to analyze, clean, and build offline ML models, only to throw the work over to an engineering team when it comes time to deploy. The engineers take the work of the data scientists, completely reproduce it in a different environment, validate it against the original models, and through many hours of work, productionize the ML model. That’s a lot of work.

We can do better than this. There are already several solutions out there for productionizing ML pipelines, such as prediction.io, H2O.ai, and y-hat. MLeap offers a unique solution to the problem by allowing you to directly export Spark models built with Spark ML to the production-ready MLeap runtime. Let’s take a look at an example of how we can build a production-ready pipeline in Spark and deploy it to a RESTful API server within minutes. For our example we will analyze a dataset from Airbnb and try to predict the price of a listing based on several features.

The Dataset
Let’s take a look at the data in our Airbnb dataset and the variables we will be using to generate our ML pipelines. This is a set of all Airbnb listings taken from Inside Airbnb. Our dependent variable will be price, which is the price per night for that listing.

Here are the features we will use:
Continuous features
• bathrooms
• bedrooms
• security_deposit
• cleaning_fee
• extra_people
• number_of_reviews
• review_scores_rating

Categorical features
• room_type
• host_is_super_host
• cancellation_policy
• instant_bookable

The Pipeline
We are going to train a random forest regression model to predict the price of the listing. We will do this in two steps:

1. Train the feature pipeline against our complete dataset
2. Train the random forest regression against a randomly-split training dataset

The creation of our feature pipeline will be broken into the following steps:

1. Assemble the continuous features into a vector
2. Scale the continuous feature vector
3. String index our categorical features
4. Assemble the categorical feature indices into a vector
5. Assemble the scaled continuous features and categorical indices vector into one final output feature vector used for model training

Train the Model
Train the ML pipeline using the TrainDemoPipeline driver included in the demo code. (see below)

Install SBT
If you haven’t done so already, install SBT, a build tool, similar to Maven or Ant, that is primarily used with Scala projects.
If you have Brew, just type in this command: (If you don’t have it, download it from the link.)

brew install sbt

Get the Demo Source Code

git clone https://github.com/TrueCar/mleap-demo.git
cd mleap-demo

Download the Training Dataset

curl https://s3-us-west-1.amazonaws.com/mleap/blog-2016-3-5/airbnb.csv.gz \
-o /tmp/airbnb.csv.gz && gunzip /tmp/airbnb.csv.gz

Build the Demo Assembly and Train the Model

If you downloaded the sample dataset to /tmp/airbnb.csv, then run the command with these values:

sbt "demo/run /tmp/airbnb.csv /tmp/transformer.mleap"

(And if you downloaded it elsewhere, just replace the file path with whichever one you have.)

Deploy a JSON API Server
Next we will take the MLeap model and deploy it using our demo API server.

Start the Demo API Server

sbt "server/run /tmp/transformer.mleap"

This will fire up a local server running on port 8080 that is ready to transform incoming LeapFrames. Let’s try a sample curl command. Download our sample frame.json file; we will send it to our server to transform and return the results.

Download the Sample LeapFrame

curl https://s3-us-west-1.amazonaws.com/mleap/blog-2016-3-5/frame.json -o /tmp/frame.json

Predict a Listing Price

curl -v -XPOST \
-H "content-type: application/json" \
-d @/tmp/frame.json http://localhost:8080/transform

And voilà! We have our transformed LeapFrame with our price prediction as the last value in the array. This transformation did not use a SparkContext, and we did not have to include any Spark libraries to make it happen. On average, the actual transformation time is about .011ms, with serialization/deserialization taking up the majority of time in our API server. If we were to take the “older” approach of Kryo serializing our Spark pipeline, then running it with a local SparkContext on our API server, average transformation time increases to 22ms. MLeap transformations currently execute around 2000x faster than out-of-the box Spark transformations for one-off requests.

Benchmarks

Benchmarks were performed on a Macbook Pro 2.8GHz Intel Core i7 with 16GB of RAM.
Java Info (OracleJDK):
• java version “1.8.0_25”
• Java(TM) SE Runtime Environment (build 1.8.0_25-b17)
• Java HotSpot(TM) 64-Bit Server VM (build 25.25-b02, mixed mode)
Scala Version 2.11.7

MLeap Runtime Benchmark
This benchmark takes the transformer we built then transforms the sample dataframe over and over again. As expected, transformation time increases linearly with the number of transformations performed. For this transformation, the average time is about .011ms.

[caption id=”attachment_654" align=”alignnone” width=”300"]

MLeap Runtime Benchmark

MLeap Runtime Benchmark[/caption]

Spark Benchmark
This benchmark takes the Spark transformer we built then transforms the sample dataframe using a SparkContext over and over again. We are trying to simulate what it would be like if we were to serialize our Spark pipelines with Kryo, load them into an API server, and execute them with real-time, one-off requests using a local SparkContext with master set to “local[2]”. Every single request requires the creation of a plan for the dataframe. This is an expensive operation, hence why we see the reduced performance speed. The average time for a transform pipeline to execute is about 22ms.

[caption id=”attachment_649" align=”alignnone” width=”300"]

Spark Model Benchmark for transforming dataframes

Spark Model Benchmark for transforming dataframes[/caption]

Get the Code/Data
Get the code for:
MLeap demo at https://github.com/TrueCar/mleap-demo

git clone https://github.com/TrueCar/mleap-demo.git

Get the source code for MLeap at https://github.com/TrueCar/mleap

git clone https://github.com/TrueCar/mleap.git

MLeap is pushed to Sonatype snapshots, so all you have to do is include it as a dependency of your project:

// I just want to run my MLeap pipeline
libraryDependencies += "com.truecar.mleap" %% "mleap-runtime" % "0.1-SNAPSHOT"
// I want to train a Spark pipeline and export it to MLeap
libraryDependencies += "com.truecar.mleap" %% "mleap-spark" % "0.1-SNAPSHOT"
How to Contribute
1. Try using MLeap for a project at your company and tell us how it works for you.
2. File a bug report or feature request in the GitHub tracker.
3. Contribute an estimator/transformer pair to the project.
4. Create a pull request with a feature or bug fix.
5. https://github.com/TrueCar/mleap

--

--

Driven by Code
Driven by Code

Welcome to TrueCar’s technology blog, where we write about the interesting things we‘re working on. Read, engage, and come work with us!