Spark+AI Summit 2020 Overview

Alexandre Bergere
datalex
Published in
7 min readAug 25, 2020

This year, due to exceptional circumstances- covi19 — Spark+AI Summit 2020 had to adapt and plan a specific summit. For the first time, the summit was entirely online and free. Anyone who registered could watch the sessions, chat with the speaker during the presentation (as they were previously recorded) and ask questions to Databricks team members. The virtual summit was a real opportunity to immerse oneself for a week in the world of Spark and the Databricks ecosystem.

The data and AI community continues to grow each year, bringing new innovations around data engineering, data science and machine learning. This year the virtual Spark + AI Summit 2020 was the largest community event ever with almost 70,000 registrants!

I would like to share with you how the summit went, then the different updates announced during the event and the main trends I followed.

This year has also been a special year for the Spark community for two reasons : The release of Spark 3.0 and Spark’s 10th anniversary.

Matei Zaharia & Ali Ghodsi

Keynotes:

Every morning and afternoon started with a keynote and as all other summit or tech events: road map, new features, demo or client’s project are presented.

Three interesting presentations were given by the main actors of Databricks & Spark:

Main keynotes’ speakers — https://databricks.com/sparkaisummit/north-america-2020

All Keynotes are available here.

What’s new?

The summit is an opportunity to share new updates or features. A few were announced during the summit:

  • Introduction of Delta Engine: accelerates the performance of Delta Lake for SQL and data frame workloads to make it easier for customers to adopt and scale a lakehouse architecture.
  • Acquisition of Redash the company behind the open source project of the same name: Redash is collaborative visualization and dashboarding platform designed to enable anyone, regardless of their level of technical sophistication, to share insights within and across teams. SQL users leverage Redash to explore, query, visualize, and share data from any data sources.
  • Introducing Koalas 1.0: It now implements the most commonly used pandas APIs, with 80% coverage of all the pandas APIs. In addition, Koalas supports Apache Spark 3.0, Python 3.8, Spark accessor, new type hints, and better in-place operations.
  • Introducing the Next-Generation Data Science Workspace: The next-generation Data Science Workspace on Databricks navigates these trade-offs to provide an open and unified experience for modern data teams.
  • MlFlow + Linux Foundation: After Delta last year, LinuxFoundation will host the open source project MlFlow, in order to make MlFlow the Open Standard for Machine Learning Platforms.
Next-Generation Data Science Workspace

Training:

From June 22nd to 23rd, training workshops were held that included a mix of instructions and practical exercises to help you improve your skills on Apache Spark.

Different themes were proposed, with each session lasting for three hours:

  • Spark: Introduction to apache spark programming, Apache spark tuning and best practices
  • Datascience: Introduction to reinforcement learning, Scaling deep learning with tensorflow and apache spark, Apache spark for machine learning and data science, Mlflow: managing the machine learning lifecycle
  • Delta: Building better data pipelines for apache spark with delta lake
  • SQL on Databricks
  • Stream: Structures streaming with Databricks
  • Introduction to unified data analytics for managers
  • Databricks administration

Certification:

A online session was available on the 23 of June to pass the Databricks Certified Associate for Apache Spark 2.4 with, you had the opportunity to choose the Half-Day Prep Course Databricks Certification Exam for $200.

The Databricks Certified Associate for Apache Spark 2.4 validates your knowledge of the core components of the DataFrames API and also validates that you have a rudimentary knowledge of the Spark Architecture.

With the new release of Spark 3.0, you can now pass as well the Databricks Certified Associate Developer for Apache Spark 3.0.

A series of sessions

As usual, hundred of sessions was available, split into differents topics: Analytics, Apache Spark Use Cases, Architecture, Databricks Tech Talks Deep Learning, Hands on tutorials, Machine Learning, Python, Sponsored Sessions, Technical Deep Dives, & Technical vs Non-Technical Techniques.

Spark 3.0

The summit was the occasion to present the new big release of Apache Spark — Spark 3.0.0.

Here are the biggest new features in Spark 3.0:

https://databricks.com/fr/blog/2020/06/18/introducing-apache-spark-3-0-now-available-in-databricks-runtime-7-0.html

You can find the full list of new feature through this article or deep dive into Spark 3.0 by watching the following session:

Apache Spark 3.0.0 release is also available on Databricks as part of the Databricks Runtime 7.0.

Apache Spark & Big Data community

Spark+AI Summit is the main event to start into Apache Spark™ and its community.

More than Databricks, the event regroups session about Apache Spark or other Apache products interacts with it.

Apache Spark — Lightning-fast unified analytics engine

Delta

Delta Lake is an open-source storage layer that brings ACID transactions, over Parquet format, to Apache Spark™ and big data workloads.

Delta lake uses versioned Parquet files to store your data in your cloud storage. Apart from the versions, Delta Lake also stores a transaction log to keep track of all the commits made to the table or blob store directory to provide ACID transactions.

Open-source storage layer that brings ACID
transactions to Apache Spark™ and big data workloads — https://delta.io/

A lot of sessions during the Summit was on Delta, just some of them that you should watch:

For more informations, don’t hesitate to join the different communities:

Lakehouse architecture

https://databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html

Just before the summit, Databricks announced Delta Engine, which ties together a 100% Apache Spark-compatible vectorized query engine to take advantage of modern CPU architecture with optimizations to Spark 3.0’s query optimizer and caching capabilities that were launched as part of Databricks Runtime 7.0. Together, these features significantly accelerate query performance on data lakes, especially those enabled by Delta Lake, to make it easier for customers to adopt and scale a lakehouse architecture.

The research paper on the inner workings of the Lakehouse was accepted and published at VLDB’2020.

Integration with other environments

Spark & AI Summit was also the opportunity to discover uses cases or projects who implement Spark into a real architecture. It’s a great opportunity to discover use cases and interactions with other environnements and tools.

Here some useful architectures or technologies implementations:

All videos and supports are already available.

If you want to learn more about Spark, Delta or MlFlow I invite you to check the documentation, the databricks blog or the databricks academy for deep dive training.

Databricks Academy — https://academy.databricks.com/

--

--

Alexandre Bergere
datalex
Editor for

Data Architect & Solution Architect independent ☁️ Delta & openLineage lover.