Spark AI Summit 2019 Overview

Published in

datalex

8 min readOct 24, 2019

After a great 2 days in Amsterdam, for the SparkAISummit 2019 in Europe, I would like to share with you how the summit went, then the different updates announced during the event and the main trends I followed.

How the summit went?

The little brother of the Spark + AI Summit located in San Francisco, took place from the 15th -17th October in Amsterdam. This year, more than 2 300 people attended the conference, for its 4th edition.

You can follow conferences, from 9:00 AM to 6:00 PM. Ten concurrent conferences, which last 40 minutes, with a 10 minute break between them.

Conferences are grouped under different classification: AI Use Cases; Developer; Data Engineering; Data Science, Machine Learning & Deep Learning; Data & ML Research; Data and ML Industry Use Cases; Sponsor Session & Tutorials.

Keynotes:

Every morning started with a keynote and as all other summit or tech events: road map, new features, demo or client’s project are presented.

Three interesting presentations were given by the main actors of Databricks & Spark:

Matei Zaharia: Simplifying Model Management with MLflow (more detail below.)
Ali Ghodsi: Unified Data Analytics: Helping Data Teams Solve the World’s Toughest Problems.
Michael Ambrust: New developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, and Koalas.

Main keynotes’ speakers — https://databricks.com/sparkaisummit/europe

I found two other interventions really interesting:

The presentation of StarCraft II program: AlphaStar. The first Artificial Intelligence which (who?) defeat a top professional player by Oriol Vinyals — Principal Scientist at Google DeepMind.

AlphaStar: Mastering the Real-Time Strategy Game StarCraft II - Databricks

Games have been used for decades as an important way to test and evaluate the performance of artificial intelligence…

databricks.com

The passionate presentation by Katie Bouman on the methodology to get an image of a black hole from algorithms.

Imaging the Unseen: Taking the First Picture of a Black Hole - Databricks

This talk will present the methods and procedures used to produce the first image of a black hole from the Event…

databricks.com

Training:

On the 15th of October, a 1-day training workshops took place that include a mix of instruction and hands-on exercises to help you improve your Apache Spark skills.

This day is an add-on to the Conference Pass and cost €795 during launch pricing.

Different themes were proposed:

Data Science with Apache Spark™
Hands on Deep Learning with Keras, Tensorflow, and Apache Spark™
Apache Spark™ Tuning and Best Practices
Apache Spark™ Programming
Building Data Pipelines for Apache Spark™ with Delta Lake
Machine Learning in Production: MLflow and Model Deployment
Half-Day Prep Course + Databricks Certification Exam

Certification:

A testing room was available all three days of Spark Summit to pass the Databricks Certified Associate for Apache Spark 2.4.

The Databricks Certified Associate for Apache Spark 2.4 validates your knowledge of the core components of the DataFrames API and also validates that you have a rudimentary knowledge of the Spark Architecture.

During the 1-day training, you had the opportunity to choose the Half-Day Prep Course + Databricks Certification Exam for €450 (included the two attempts at the certification exam).

What’s new?

The summit is an opportunity to share new updates or features. Last week, a few were announced:

Delta lake + Linux Foundation: LinuxFoundation will host the open source project DeltaLakeOSS, allowing an open governance model that encourages participation and technical contribution, and provide a framework for long-term stewardship: https://dbricks.co/wp191016a.
Databricks announced an investment of 100 million euros in our European Development Center in Amsterdam.
Databricks announced “Model Registry”, a new capability within MLflow that enables a comprehensive model management process by providing data scientists and engineers a repository to track, share and collaborate on machine learning model. You can check it on GitHub. For more information, read these two posts: managed-mlflow & introducing-the-mlflow-model-registrery.
Databricks’ Growth Draws $400 Million Series F Investment and $6.2 Billion Valuation (this news was release later the 22th of October).

Mlflow Model Registry — src: https://databricks.com/blog/2019/10/17/introducing-the-mlflow-model-registry.html

It also also an opportunity ti discover or rediscover other actualities:

Microsoft Machine Learning for Apache Spark: an ecosystem of tools aimed towards expanding the distributed computing framework Apache Spark in several new directions. MMLSpark adds many deep learning and data science tools to the Spark ecosystem, including seamless integration of Spark Machine Learning pipelines with Microsoft Cognitive Toolkit (CNTK), LightGBM and OpenCV.

MMLSpark’s projects — src: https://github.com/azure/mmlspark

A series of sessions

Delta lake

Delta Lake is an open-source storage layer that brings ACID transactions, over Parquet format, to Apache Spark™ and big data workloads.

Delta lake uses versioned Parquet files to store your data in your cloud storage. Apart from the versions, Delta Lake also stores a transaction log to keep track of all the commits made to the table or blob store directory to provide ACID transactions.

Open-source storage layer that brings ACID
transactions to Apache Spark™ and big data workloads — https://delta.io/

Characteristics of the Delta Architecture:

Adopt a continuous data flow model to unify batch and streaming.
Use intermediate hops to improve reliability and troubleshooting.
Make the cost vs latency tradeoff based on your use cases and business needs.
Optimize the storage layout based on the access patterns.
Reprocess the historical data as needed by simply clearing the restumt table and restarting the stream.
Incrementally improve the quality of your data until is ready for consumption with schema on read/write and data expectations.

A lot of sessions during the Summit was on Delta, just some of them that you should watch:

Building Reliable Data Lakes at Scale with Delta Lake by Burak Yavuz, Tathagata Das & Mukul Murthy
Simplify and Scale Data Engineering Pipelines with Delta Lake by Amanda Moran
Designing ETL Pipelines with Structured Streaming and Delta Lake — How to Architect Things Right by Tathagata Das
Databricks Delta Lake and Its Benefits by Nitin Raj Soundararajan & Nagaraj Sengodan

A great Notebook was published during the event, to understand the principle and advantages of Delta: https://github.com/delta-io/delta/tree/master/examples/tutorials/saiseu19.

For more informations, don’t hesitate to join the different communities:

Koalas

Koalas bring the bridge in the top of Spark & Pandas.

The Koalas project makes data scientists more productive when interacting with big data, by implementing the pandas DataFrame API on top of Apache Spark.

Pandas is the de facto standard (single-node) DataFrame implementation in Python, while Spark is the de facto standard for big data processing. With Koalas, you can:

Be immediately productive with Spark, with no learning curve, if you are already familiar with pandas.
Have a single codebase that works both with pandas (tests, smaller datasets) and with Spark (distributed datasets).

Nowadays, the most common functions have been implemented:

60% of the Dataframes / Series API
60% of the DataframesGroupBy / SeriesGroupBy API
15% of the Index / MultiIndex API
to_datetime, get_dummies ….

Koalas is a very active community with daily changes and bi-weekly releases.

I’d the chance to follow the 1h30 tutorial entitled “Koalas: Pandas on Apache Spark” given by Tim Hunter, Brooke Wenig & Niall Turbitt. The notebooks of the tutorial are available here:

Integration with other environments

Spark & AI Summit was also the opportunity to discover uses cases or projects who implement Spark into a real architecture. It’s a great opportunity to discover use cases and interactions with other environnements and tools.

Here some useful architectures or technologies implementations:

Transforming AI with Graphs: Real World Examples using Spark and Neo4j
An 1h30 minutes tutorial of Cosmos DB Real-time Advanced Analytics Workshop by Srilakshmi Chintala

Here a bucket of some of the uses cases presented:

AI on Spark for Malware Analysis and Anomalous Threat Detection by Jakub Sanojca & Joao Da Silva
Deep Anomaly Detection from Research to Production Leveraging Spark and Tensorflow by Davit Bzhalava & Shaheer Mansoor
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction with Geospatial Visualization by Hongchan Roh & Dooyoung Hwang

Deep dive

What will be a Spark Summit without sessions about optimization?

I followed different sessions about it, all explained with passion:

A 1h45 deep-dive session about Apache Spark Core — Practical Optimization by Daniel Tomes
Performance Troubleshooting Using Apache Spark Metrics by Luca Canali
How to Automate Performance Tuning for Apache Spark by Jean-Yves Stephan & Julien Dumazert
Best Practices for Building and Deploying Data Pipelines in Apache Spark by Vicky Avison & Alex Bush

During these talks, I also discovered two interesting technologies regarding the optimization process:

Waimak: an open-source framework that makes it easier to create complex data flows in Apache Spark. Waimak aims to abstract the more complex parts of Spark application development (such as orchestration) away from the business logic, allowing users to get their business logic in a production-ready state much faster. By using a framework written by Data Engineers, the teams defining the business logic can write and own their production code.
Datamechanics: The hassle-free Spark platform deployed on Kubernetes. By automatically tuning infrastructure and Spark configurations dynamically and continuously for each of your workloads, their platform makes your applications 2x as fast and stable.

Run autoscaled Jupyter kernel with Spark — https://www.datamechanics.co/

Here some recommendations, given by Daniel Tomes, that’s you should follow when you’re working on Spark’s project:

Utilize lazy loading ( https://databricks.com/blog/2016/10/18/7-tips-to-debug-apache-spark-code-faster-with-databricks.html)
Maximize your hardware
Balance
Optimized joins ( https://databricks.com/session/optimizing-apache-spark-sql-joins)
Minimize data movement
Minimize repetition
Only use vectorized UDFs

All videos are already available and supports will be released the 31th of October.

If you want to learn more about Spark, Delta or MlFlow I invite you to check the documentation, the databricks blog or the databricks academy for deep dive training.