Build End-to-End AI Pipelines Using Ray and Apache Spark

Jason Dai
Distributed Computing with Ray
6 min readSep 2, 2020

In my previous blog for our CVPR 2020 tutorial, I shared our efforts in scaling AI for distributed Big Data, a key challenge that arises as AI moves from experimentation to production. In addition to scaling, another key challenge in practice is how to seamlessly integrate data processing and AI programs into a single unified pipeline. Conventional approaches would set up two separate clusters, one for big data processing, and the other dedicated to deep learning and AI. But this not only introduces a lot of data transfer overhead, but also requires additional efforts for managing separate workflows and systems in production. In this post, I will share our efforts in building the end-to-end big data and AI pipelines using Ray* and Apache Spark* (on a single Xeon cluster with Analytics Zoo).

RayOnSpark: Seamlessly Integrate Ray into a Big Data Platform

Ray is a distributed framework for emerging AI applications open-sourced by UC Berkeley RISELab. It implements a unified interface, distributed scheduler, and distributed and fault-tolerant store to address the new and demanding systems requirements for advanced AI technologies. Ray allows users to easily and efficiently run many emerging AI applications, such as deep reinforcement learning using RLlib, scalable hyperparameter search using Ray Tune, automatic program synthesis using AutoPandas, etc.

Analytics Zoo seamlessly integrates Ray into Big Data platform through the RayOnSpark [1] support, which directly runs Ray programs on Apache Hadoop* or Kubernetes*, so that users can easily try various emerging AI applications on their existing Big Data clusters. In addition, it also allows Ray applications to seamlessly integrate into Big Data processing pipeline and directly run on in-memory Spark RDDs or DataFrames.

Figure 1. RayOnSpark Architecture

Figure 1 illustrates the architecture of RayOnSpark. In the Spark implementation, a Spark program runs on the driver node and creates a SparkContext object, which is responsible for launching multiple Spark executors on cluster to run Spark jobs. In RayOnSpark, the Spark driver program can also create a RayContext object, which will automatically launch Ray processes alongside each Spark executor; it will also create a RayManager inside each Spark executor to manage Ray processes (e.g., automatically shutting down the processes when the program exits).

Figure 2. Write Ray code inside the Spark program

As a result, one can directly write Ray code inside the Spark program (as illustrated in Figure 2), and we have built several advanced end-to-end AI applications (such as AutoML) using RayOnSpark for real-world use cases.

Scalable AutoML for Time Series Forecasting

Time series forecasting is widely used in many real world applications (such as network quality analysis in Telcos, log analysis for data center operations, predictive maintenance for high-value equipment, etc.). Deep learning methods often perceive time series forecasting as a sequence modeling problem and have recently been applied to these problems with many successes.

On the other hand, building the machine learning applications for time series forecasting can be a laborious and knowledge-intensive process. In order to provide an easy-to-use time series forecasting toolkit, we have applied Automated Machine Learning (AutoML) to time series prediction [2], built on top of Ray Tune and RayOnSpark.

Figure 3 illustrates the architecture of the AutoML framework. The AutoML framework uses Ray Tune for hyper-parameter search (running on top of RayOnSpark), including both feature engineering and modeling. For feature engineering, the search engine selects the best subset of features from a set of features that are automatically generated by various feature generation tools (e.g. featuretools*). For modeling, the search engine searches for hyper-parameters such as number of nodes per layer, learning rate, etc.

Figure 3. Architecture of the AutoML Framework

Figure 4 illustrates the training program of the time series forecasting pipeline using simple AutoML APIs. Under the hood, the AutoML framework first instantiates a FeatureTransformer and a Model; then a SearchEngine is instantiated and configured with search presets (which specify how the hyper-parameters are searched, the reward metric, etc.). The SearchEngine runs the search procedure on top of Ray Tune; each run generates several trials (each with a different combination of hyper-parameters) at a time and distributes the trials in the Ray cluster. After all trials complete, the best set of hyper-parameters and optimized model are retrieved according to the target metrics, which are used to compose the resulting Pipeline.

Figure 4. Time sequence forecasting using AutoML

The AutoML framework in Analytics Zoo has been used by many real-world users. For instance, Tencent* Cloud have integrated the AutoML framework into the TI-One* ML Platform [3]; as a result, TI-One can effectively automate feature generation, model selection, and hyperparameter tuning for user’s time series forecasting applications (leveraging Ray Tune and RayOnSpark). For instance, Figure 5 shows the predicted taxi passenger volume (based on the NYC taxi passengers dataset of Numenta Anomaly Benchmark) using AutoML on the TI-One Platform.

Figure 5. Number of taxi passengers of the next time step predicted by using AutoML

Context-Aware Recommendation at Burger King

In addition to AutoML, we have also built the end-to-end recommendation system using RayOnSpark in Burger King* [4]. It integrates data processing (with Spark) and distributed training (with Apache MXNet* and Ray) into a unified analysis and AI pipeline. Figure 6 illustrates the overall architecture of our system. In our recommendation system, we first launch Spark tasks to extract our restaurant transactions data stored on distributed file systems, followed by data cleaning, ETL and preprocessing steps using Spark. After the Spark tasks complete, the processed in-memory Spark RDD are directly fed into the Ray cluster through Plasma for distributed training.

Figure 6. Overview of the recommendation system based on RayOnSpark

We have built a Transformer Cross Transformer (TxT) model in MXNet, which uses Transformer architecture to encode both order sequence behavior and context features (such as weather, time and location), and then uses latent cross for joint recommendations. Inspired by the design of RaySGD, we have implemented an MXNet Estimator that provides a lightweight shim layer to automatically deploy distributed MXNet training on Ray. Both MXNet workers and parameter servers run as Ray actors, and they communicate with each other via the distributed key-value store provided by MXNet; each MXNet worker takes its local data partition in Plasma to train the model.

Such a unified design architecture integrates Spark data processing and distributed MXNet training into an end-to-end, in-memory pipeline, which runs on exactly the same cluster where our big data is stored. Consequently, we only need to maintain a single cluster for the entire AI pipeline, with no extra data transfer across different clusters and no extra cluster maintenance efforts. We have successfully deployed the recommendation system at Burger King, and our solution achieves superior results in the production environment (outperforming Google* Cloud Platform Recommendation AI service in the online A/B testing, with +100% conversion gain and +73% add-on sales gain [4]).

Summary

By leveraging RayOnSpark support in Analytics Zoo, our users (e.g., Tencent Cloud and Burger King) can easily build end-to-end big data and AI pipelines, such as AutoML for time series forecasting and context-aware recommendations. For more details, you may refer to links in the Reference section.

Reference

[1] RayOnSpark: Running Emerging AI Applications on Big Data Clusters with Ray and Analytics Zoo
[2] Scalable AutoML for Time Series Prediction Using Ray and Analytics Zoo
[3] Tencent Cloud Leverages Analytics Zoo to Improve Performance of TI-ONE ML Platform
[4] Context-Aware Fast Food Recommendation at Burger King with RayOnSpark
[5] Powered-By page for Analytics Zoo

*Other names and brands may be claimed as the property of others

--

--