MLeap: Providing (Near) Real-time Data Science with Apache Spark

Published in

Red Ventures Data Science & Engineering

6 min readAug 25, 2017

How MLeap allowed us to scale our existing predictive platform from our local machines to Apache Spark in the cloud with zero loss of functionality and sub-second response times.

At Red Ventures, we partner with the nation’s top brands to seamlessly connect customers with the products and services they need most using our advanced digital marketing and sales capabilities. Along a customer’s journey, each interaction with Red Ventures presents an opportunity to make an influential decision: from the website creative they see to the time they spend waiting in a queue to speak to an agent. However, those decisions aren’t meaningful if they can’t use data and make a recommendation in real time. To that end, we have developed a machine learning platform that is constantly making decisions and is constantly learning to account for new data and trends.

A simplified view of the Red Ventures marketing funnel

A key challenge facing most data scientists today is how to take a model from on their computer to production and enable decision-making within their organization using the models they have built. To facilitate this, and to remove any potential roadblocks that may exist without having to double-engineer predictive solutions, we’ve tried to utilize technologies that allow us to code a solution once for training, and then reuse the results of that training for our real-time scoring. In other words, we’re trying to avoid the problem of writing modeling code one way and then writing completely different code to do evaluation.

Historically, we’ve used PMML to accomplish this. Our data scientists code their models using R or Python, and we utilize R and Python libraries to transcode those language specific models into PMML. We have a Scala-based runtime system (labeled Databroker in the following diagram) for doing context widening (feature engineering) and real-time evaluation based on the JPMML evaluator. This has worked well for us — our data-scientists code a solution and we rebuild their solutions on a nightly basis using traditional CI/CD; the end result goes into our real-time scoring solution, and we can score models using simple REST calls to our service. Life is good.

It’s not all unicorns and rainbows though; some modeling techniques cannot be represented as PMML (or at very least haven’t been implemented in the encoder/evaluator) and there isn’t huge community support for PMML. Extending the technology to add our own functionality has proved to be problematic. This has lead us to doing arcane things to our PMML on the fly — XML modifications, node additions/removals, things that we’d rather not do. Plus, as we continue to try to leverage newer technologies and frameworks (Apache Spark), we face the problem of needing to re-solve all the things we’d previously hacked our way through. While Spark does have PMML support, PMML still has limited functionality and would still require direct editing of XML trees to achieve our goals. PMML has become the limiting factor for data science at Red Ventures, and we need to find an alternative solution.

Enter MLeap, which purports to fill the exact need we have in our ecosystem — serialize any PySpark/SparkML/TensorFlow pipeline into a bundle that can be loaded at runtime and scored against in real-time. We came across MLeap as we were researching PMML alternatives. The criteria for a potential alternative was straight-forward: something compatible with Apache Spark, actively developed, easy-to-understand, and extensible. Naturally, when I encountered the following slide from the MLeap presentation at Spark Summit 2016, my interest was piqued.

You had my curiosity, but now you have my attention.

We set about exploring what it would take to prototype usage of MLeap inside our existing environment. MLeap is written in Scala so it works fine inside our existing Scala-based application, which also means with a bit of refactoring we could serve MLeap models alongside our existing PMML models, so nothing downstream of our service would need to be aware of the backing technology for the predictions we are serving — provided we conform to the existing data contracts. This would give us the power to leverage new technologies for new projects as well as the flexibility to upgrade existing projects to those newer technologies while maintaining backwards compatibility. The MLeap docs also indicated it’s possible (and encouraged) to roll your own transformers/estimators that can be used inside your scoring pipelines that can fully take advantage of their serialization/deserialization/evaluation process.

Armed with this promising information, we started our our mission to make it real. A colleague of mine took on the task of solving one of our existing problems with Apache Spark MLlib — specifically one that had been problematic to solve with PMML in the past. To solve this problem, we had to roll a couple of custom Spark transformers and I’m pleased to say that the process was relatively painless. The MLeap documentation on the topic is really straightforward and pattern-matching off of that and the MLeap source code, we were able to quickly create our own pipeline stages. Once we got the Spark side working, we needed to refactor our existing Scala app so that it could speak MLeap, which in the end didn’t take very long. We had to do some workarounds with version 0.7.0 specific to exposing the expected pipeline schema; however the developers have been open to collaboration and there will be a proper fix for that in 0.8.0. I can personally speak to the reality that all software has unanticipated challenges, but the MLeap team has been excited to use our real-world use case to improve the product. After our scoring app was updated, we just needed to define the workflow that would run our Spark jobs on some regular basis, publish the resulting MLeap bundle to S3, and automatically sync that down to our scoring hosts.

The results are very compelling — our initial pipeline contained 40 stages (give or take a couple) and had a logistic regression step at the very end. If we filter out the extra overhead our Scala app introduces with widening context and adding features, we can evaluate the resulting MLeap bundle in 4ms. Adding all of our feature widening back in brings us to around 30–50ms, which is well within our tolerances for how a real-time prediction solution should behave. These early results give us the confidence that regardless of the depth of our pipelines or the modeling techniques being leveraged we’re going to get execution times that are manageable and allow us to predict at scale with tremendous speed.

To recap, we’ve hit a pivotal point in the development of our data science platform at Red Ventures:

We need to scale beyond the scope of what we can accomplish with what can be represented as PMML
We want to continue to serve real-time predictions in sub-100ms times
We want to take advantage of newer modeling technologies like Apache Spark and Tensorflow
And we need a serialization library that will allow us to do all of this.

Our experiences with MLeap have been extremely positive, and we’re looking forward to what the possibilities here allow us to achieve in the future.

MLeap: Providing (Near) Real-time Data Science with Apache Spark

Written by Cory Locklear