Seamlessly Scaling AI for Distributed Big Data

Published in

The Startup

5 min readJul 8, 2020

Originally published at LinkedIn Pulse.

Early last month, I presented a half-day tutorial on at this year’s virtual CVPR 2020. This is a very unique experience, and I would like to share some of the highlights of the tutorial.

The Problem: AI for Big Data

The tutorial focused on a critical problem that arises as AI moves from experimentation to production; that is, how to seamlessly scale AI to distributed Big Data. Today, AI researchers and data scientists need to go through a mountain of pains to apply AI models to production dataset that is stored in distributed Big Data cluster.

Conventional approaches usually set up two separate clusters, one dedicated to Big Data processing, and the other dedicated to deep learning (e.g., a GPU cluster), with “connector” (or glue code) deployed in between. Unfortunately, this “connector approach” not only introduces a lot of overheads (e.g., data copy, extra cluster maintenance, fragmented workflow, etc.), but also suffers from impedance mismatches that arise from crossing boundaries between heterogeneous components (more on this in the next section).

To address these challenges, we have developed open source technologies that directly support new AI algorithms on Big Data platforms. As shown in the slide below, this includes BigDL (distributed deep learning framework for Apache Spark) and Analytics Zoo (distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray).

A Motivating Example: JD.com

Before diving into the technical details of BigDL and Analytics Zoo, I shared a motivating example in the tutorial. JD is one of the largest online shopping websites in China; they have stored hundreds of millions of merchandise pictures in HBase, and built an end-to-end object feature extraction application to process these pictures (for image-similarity search, picture deduplication, etc.). While object detection and feature extraction are standard computer vision algorithms, this turns out to be a fairly complex data analysis pipeline when scaling to hundreds of millions pictures in production, as shown in the slide below.

Previously JD engineers had built the solution on a 5-node GPU cluster following a “connector approach”: reading data from HBase, partitioning and processing the data across the cluster, and then running the deep learning models on Caffe. This turns out to be very complex and error-prone (because the data partitioning, load balancing, fault tolerance, etc., need to be manually managed). In addition, it also reveals an impedance mismatch of the “connector approach” (HBase + Caffe in this case) — reading data from HBase takes about half of the time (because the task parallelism is tied to the number of GPU cards in the system, which is too low for interacting with HBase to read the data).

To overcome these problems, JD engineers have used BigDL to implement the end-to-end solution (including data loading, partitioning, pre-processing, DL model inference, etc.) as one unified pipeline, running on a single Spark cluster in a distributed fashion. This not only greatly improves the development productivity, but also delivers about 3.83x speedup compared to the GPU solution. You may refer to [1] and [2] for more details of this particular application.

The Technology: BigDL Framework

In the last several years, we have been driving open source technologies that seamlessly scale AI for distributed Big Data. In 2016, we open sourced BigDL, a distributed deep learning framework for Apache Spark. It is implemented as a standard library on Spark, and provides an expressive, “data-analytics integrated” deep learning programming model. As a result, users can built new deep learning applications as standard Spark programs, running on existing Big Data clusters without any change, as shown in the slide below.

Contrary to the conventional wisdom of the machine learning community (that fine-grained data access and in-place updates are critical for efficient distributed training), BigDL provides scalable distributed training directly on top of the functional compute model (with copy-on-write and coarse-grained operations) of Spark. It has implemented an efficient AllReduce like operation using existing primitives (e.g., shuffle, broadcast, in-memory cache, etc.) in Spark, which is shown to have similar performance characteristics compared to Ring AllReduce. You may refer to our SoCC 2019 paper [2] for more details.

The Technology: Analytics Zoo Platform

While BigDL provides a Spark-native framework for users to build deep learning applications, Analytics Zoo tries to solve a more general problem: how to seamlessly apply any AI models (which can be developed using TensroFlow, PyTorch, Keras, Caffe, etc.) to production data stored in Big Data clusters, in a distributed and scalable fashion.

As shown in the slide above, Analytics Zoo is implemented as a higher level platform on top of DL frameworks and distributed data analytics systems. In particular, it provides an “end-to-end pipeline layer” that seamlessly unites TensorFlow, Keras, PyTorch, Spark and Ray programs into an integrated pipeline, which that can transparently scale out to large (Big Data or K8s) clusters for distributed training and inference.

As a concrete example, the slide below shows how Analytics Zoo users can write TensorFlow or PyToch code directly inline with Spark program; as a result, the program can first process the Big Data (stored in Hive, HBase, Kafka, Parquet, etc.) using Spark, and then feed the in-memory Spark RDD or Dataframes directly to the TensorFlow/PyToch model for distributed training or inference. Under the hood, Analytics Zoo automatically handles the data partitioning, model replication, data format transformation, distributed parameter synchronization, etc., so that the TensorFlow/PyToch model can be seamless applied to the distributed Big Data.

Summary

In the tutorial, I also shared more details on how to build scalable AI pipelines for Big Data using Analytics Zoo, including advanced features (such as RayOnSpark, AutoML for time series, etc.) and real-world use cases (such as Mastercard, Azure, CERN, SK Telecom, etc.). For additional information, please see:

My CVPR 2020 tutorial website (https://jason-dai.github.io/cvpr2020/)
Analytics Zoo Github repo (https://github.com/intel-analytics/analytics-zoo)
Powered-By page for Analytics Zoo (https://analytics-zoo.github.io/master/#powered-by/)

Reference

“Building Large-Scale Image Feature Extraction with BigDL at JD.com”, https://software.intel.com/en-us/articles/building-large-scale-image-feature-extraction-with-bigdl-at-jdcom
“BigDL: A Distributed Deep Learning Framework for Big Data”, ACM Symposium of Cloud Computing conference (SoCC) 2019, https://arxiv.org/abs/1804.05839