5 Key Topics of Spark Summit Europe 2016

BigData Republic
bigdatarepublic
Published in
5 min readNov 2, 2016

One of the tools BigData Republic uses for its large scale data crunching is Apache Spark. While Spark at first was mainly focused on processing large amounts of data in an efficient way, it is continuing to increase support for major machine learning algorithms. This way, it is becoming the defacto standard for data science and big data engineering.

Together with my colleagues Gerben Oostra and Alexander Backus, I attended the Spark Summit in Brussels last week. In this blog post, I will discuss the (in my opinion) 5 main topics to keep you updated with the state-of-the-art developments.

Note that all of the talks from this Spark Summit will be freely available on the official site, expected the 4th of November.

1. Deep Learning support

This is one of the coolest additions for Spark. Using Tensorframes, which are basically a wrapper around the TensorFlow framework, it is possible to utilize the GPUs that might be available on worker nodes for the purpose of training deep neural networks. In fact, all machine learning algorithms implemented in TensorFlow should become available this way.

The idea is to utilize Spark’s distributed features to (horizontally) distribute relevant input parameters over the worker nodes, and then use TensorFlow to (vertically) optimise computations for specific node hardware including GPUs. This is especially handy if you want to run compute intensive parameter tuning jobs for your machine learning model.

One of the key features that enables data passing between TensorFlow and Spark efficiently is the new Spark columnar in-memory storage format. This helps porting between TensorFlow as well as other machine learning frameworks who use vectorized (column-wise) data structures to operate on, as opposed to Spark’s default row-based format.

Deep Learning on Spark with TensorFrames
Deep Learning on Spark with TensorFrames

Related recommended talks to watch: TensorFrames: Deep Learning with TensorFlow on Apache Spark

2. Spark Performance improvements

Since Spark was first released many new features have been introduced, but run-time performance optimizations were not yet touched upon. Most of the code responsible for the actual execution of computations in a Spark job consisted of many virtual function calls (a by-product of Object Oriented programming in Scala). Many standard operations in Spark code are highly generic, but therefore hard to optimize for the Scala compiler and at run-time by the CPU (e.g. branch-prediction is hard). To solve these problems, Tungsten was developed. This is a whole-stage at run-time code generation component that first inspects the whole Spark job end-to-end, then generates code that is optimised for the particular job instead of the job being highly generic.

Tungsten CodeGen Spark 2.0
Tungsten CodeGen Spark 2.0

Specifically for Spark SQL, Spark 2.0 also comes with improvements to the component that selects which operations are actually selected to be part of the job. This is the Catalyst SQL query plan optimizer, which defines strategies on which operations to use. For example, for joining two tables, multiple types of JOIN operations could be used ranging from a map-side join using broadcasts to fully distributed cartesian product joins. Tungsten and Catalyst are complementary components. Catalyst first optimises the query plan for a job, then Tungsten generates highly optimized code for each of the operations inside that job.

Catalyst Optimizer Spark 2.0
Catalyst Optimizer Spark 2.0

Related recommended talks to watch: Spark’s Performance: The Past, Present, and Future (Tungsten), A Deep Dive into the Catalyst Optimizer (Catalyst)

3. Productionizing your model

As more companies start implementing data science for their business, they realise they need to integrate their machine learning models with their existing workflows and infrastructure to get a return on their investment. However, most of the models data scientists create are not suitable to be put into production directly. Using software libraries like MLeap or H2O allows data scientists to export their machine learning models to a standardised format that can be easily integrated in (existing) big data pipelines. In addition the companies backing these libraries try to provide users with a platform that makes deployment to production easier, even going as far as providing a one-click production deployment solution.

MLeap Vision
MLeap Vision

Related recommended talks to watch: MLeap and Combust.ML: Deploying Machine Learning Models to Production (MLeap), Sparkling Water 2.0: The Next Generation of Machine Learning on Apache Spark (H2O)

4. Online machine learning with structured streaming

Sadly I was not fast enough to make a screen for this topic. But nevertheless this is one of the nicest talks I have attended at the Spark Summit. Most of the machine learning libraries are focused on training models in batch. This limits models that are already implemented in real-time big data processing pipelines, because they cannot adapt to fresh data in near real-time.

However, more effort is put in to make online training possible using Spark Streaming. This means a model can be updated in near real-time whenever the big data pipeline feeds it new data.

The technical implementation in Structured Spark Streaming is roughly as follows:

  1. A distributed cache (such as Redis or Glint, see point 5) is used to store the most recent model
  2. Whenever new data comes in this model is distributed to all Spark Streaming workers.
  3. Each of the workers performs one training step using the model and their own input data, resulting in N (number of workers) new models.
  4. In the reduce step these N new models are aggregated (using a user defined function) to one new model that is stored in the cache again.

This looks very promising, but take note that the caveat is in the details of choosing a good aggregation function.

Related recommended talks to watch: Online Learning with Structured Streaming

5. Factorization Machines and Parameter servers

One of the more recent development in machine learning is Factorization Machines. This machine learning algorithm is especially suitable for very sparse datasets, but no distributed implementation was yet available. In one of the talks this was introduced, as well as the Glint parameter server.

The Glint parameter server is a Spark-optimized distributed cache for storing common machine learning parameter data structures. This component should prove useful for making more embedded distributed machine learning implementation possible in the future.

Factorization Machines on Spark with GlintFM
Factorization Machines on Spark with GlintFM

Related recommended talks to watch: Scaling Factorization Machines on Spark Using Parameter Servers , Glint: An Asynchronous Parameter Server for Spark

--

--