Spark AI Summit 2019 Overview

Alexandre Bergere
datalex
Published in
8 min readOct 24, 2019

After a great 2 days in Amsterdam, for the SparkAISummit 2019 in Europe, I would like to share with you how the summit went, then the different updates announced during the event and the main trends I followed.

How the summit went?

The little brother of the Spark + AI Summit located in San Francisco, took place from the 15th -17th October in Amsterdam. This year, more than 2 300 people attended the conference, for its 4th edition.

You can follow conferences, from 9:00 AM to 6:00 PM. Ten concurrent conferences, which last 40 minutes, with a 10 minute break between them.

Conferences are grouped under different classification: AI Use Cases; Developer; Data Engineering; Data Science, Machine Learning & Deep Learning; Data & ML Research; Data and ML Industry Use Cases; Sponsor Session & Tutorials.

Keynotes:

Every morning started with a keynote and as all other summit or tech events: road map, new features, demo or client’s project are presented.

Three interesting presentations were given by the main actors of Databricks & Spark:

Main keynotes’ speakers — https://databricks.com/sparkaisummit/europe

I found two other interventions really interesting:

  • The presentation of StarCraft II program: AlphaStar. The first Artificial Intelligence which (who?) defeat a top professional player by Oriol Vinyals — Principal Scientist at Google DeepMind.
  • The passionate presentation by Katie Bouman on the methodology to get an image of a black hole from algorithms.

Training:

On the 15th of October, a 1-day training workshops took place that include a mix of instruction and hands-on exercises to help you improve your Apache Spark skills.

This day is an add-on to the Conference Pass and cost €795 during launch pricing.

Different themes were proposed:

  • Data Science with Apache Spark™
  • Hands on Deep Learning with Keras, Tensorflow, and Apache Spark™
  • Apache Spark™ Tuning and Best Practices
  • Apache Spark™ Programming
  • Building Data Pipelines for Apache Spark™ with Delta Lake
  • Machine Learning in Production: MLflow and Model Deployment
  • Half-Day Prep Course + Databricks Certification Exam

Certification:

A testing room was available all three days of Spark Summit to pass the Databricks Certified Associate for Apache Spark 2.4.

The Databricks Certified Associate for Apache Spark 2.4 validates your knowledge of the core components of the DataFrames API and also validates that you have a rudimentary knowledge of the Spark Architecture.

During the 1-day training, you had the opportunity to choose the Half-Day Prep Course + Databricks Certification Exam for €450 (included the two attempts at the certification exam).

What’s new?

The summit is an opportunity to share new updates or features. Last week, a few were announced:

  • Delta lake + Linux Foundation: LinuxFoundation will host the open source project DeltaLakeOSS, allowing an open governance model that encourages participation and technical contribution, and provide a framework for long-term stewardship: https://dbricks.co/wp191016a.
  • Databricks announced an investment of 100 million euros in our European Development Center in Amsterdam.
  • Databricks announced “Model Registry”, a new capability within MLflow that enables a comprehensive model management process by providing data scientists and engineers a repository to track, share and collaborate on machine learning model. You can check it on GitHub. For more information, read these two posts: managed-mlflow & introducing-the-mlflow-model-registrery.
  • Databricks’ Growth Draws $400 Million Series F Investment and $6.2 Billion Valuation (this news was release later the 22th of October).
Mlflow Model Registry — src: https://databricks.com/blog/2019/10/17/introducing-the-mlflow-model-registry.html

It also also an opportunity ti discover or rediscover other actualities:

  • Microsoft Machine Learning for Apache Spark: an ecosystem of tools aimed towards expanding the distributed computing framework Apache Spark in several new directions. MMLSpark adds many deep learning and data science tools to the Spark ecosystem, including seamless integration of Spark Machine Learning pipelines with Microsoft Cognitive Toolkit (CNTK), LightGBM and OpenCV.
MMLSpark’s projects — src: https://github.com/azure/mmlspark

A series of sessions

Delta lake

Delta Lake is an open-source storage layer that brings ACID transactions, over Parquet format, to Apache Spark™ and big data workloads.

Delta lake uses versioned Parquet files to store your data in your cloud storage. Apart from the versions, Delta Lake also stores a transaction log to keep track of all the commits made to the table or blob store directory to provide ACID transactions.

Open-source storage layer that brings ACID
transactions to Apache Spark™ and big data workloads — https://delta.io/

Characteristics of the Delta Architecture:

  • Adopt a continuous data flow model to unify batch and streaming.
  • Use intermediate hops to improve reliability and troubleshooting.
  • Make the cost vs latency tradeoff based on your use cases and business needs.
  • Optimize the storage layout based on the access patterns.
  • Reprocess the historical data as needed by simply clearing the restumt table and restarting the stream.
  • Incrementally improve the quality of your data until is ready for consumption with schema on read/write and data expectations.

A lot of sessions during the Summit was on Delta, just some of them that you should watch:

A great Notebook was published during the event, to understand the principle and advantages of Delta: https://github.com/delta-io/delta/tree/master/examples/tutorials/saiseu19.

For more informations, don’t hesitate to join the different communities:

Koalas

Koalas bring the bridge in the top of Spark & Pandas.

https://github.com/databricks/koalas

The Koalas project makes data scientists more productive when interacting with big data, by implementing the pandas DataFrame API on top of Apache Spark.

Pandas is the de facto standard (single-node) DataFrame implementation in Python, while Spark is the de facto standard for big data processing. With Koalas, you can:

  • Be immediately productive with Spark, with no learning curve, if you are already familiar with pandas.
  • Have a single codebase that works both with pandas (tests, smaller datasets) and with Spark (distributed datasets).

Nowadays, the most common functions have been implemented:

  • 60% of the Dataframes / Series API
  • 60% of the DataframesGroupBy / SeriesGroupBy API
  • 15% of the Index / MultiIndex API
  • to_datetime, get_dummies ….

Koalas is a very active community with daily changes and bi-weekly releases.

I’d the chance to follow the 1h30 tutorial entitled “Koalas: Pandas on Apache Spark” given by Tim Hunter, Brooke Wenig & Niall Turbitt. The notebooks of the tutorial are available here:

Integration with other environments

Spark & AI Summit was also the opportunity to discover uses cases or projects who implement Spark into a real architecture. It’s a great opportunity to discover use cases and interactions with other environnements and tools.

Here some useful architectures or technologies implementations:

Here a bucket of some of the uses cases presented:

Deep dive

What will be a Spark Summit without sessions about optimization?

I followed different sessions about it, all explained with passion:

During these talks, I also discovered two interesting technologies regarding the optimization process:

  • Waimak: an open-source framework that makes it easier to create complex data flows in Apache Spark. Waimak aims to abstract the more complex parts of Spark application development (such as orchestration) away from the business logic, allowing users to get their business logic in a production-ready state much faster. By using a framework written by Data Engineers, the teams defining the business logic can write and own their production code.
  • Datamechanics: The hassle-free Spark platform deployed on Kubernetes. By automatically tuning infrastructure and Spark configurations dynamically and continuously for each of your workloads, their platform makes your applications 2x as fast and stable.
Run autoscaled Jupyter kernel with Spark — https://www.datamechanics.co/

Here some recommendations, given by Daniel Tomes, that’s you should follow when you’re working on Spark’s project:

All videos are already available and supports will be released the 31th of October.

If you want to learn more about Spark, Delta or MlFlow I invite you to check the documentation, the databricks blog or the databricks academy for deep dive training.

Databricks Academy — https://academy.databricks.com/

Looking forward to the Spark & AI Summit 2020 in Berlin.

--

--

Alexandre Bergere
datalex
Editor for

Data Architect & Solution Architect independent ☁️ Delta & openLineage lover.