Spark+AI Summit 2020 Overview

Published in

datalex

7 min readAug 25, 2020

This year, due to exceptional circumstances- covi19 — Spark+AI Summit 2020 had to adapt and plan a specific summit. For the first time, the summit was entirely online and free. Anyone who registered could watch the sessions, chat with the speaker during the presentation (as they were previously recorded) and ask questions to Databricks team members. The virtual summit was a real opportunity to immerse oneself for a week in the world of Spark and the Databricks ecosystem.

The data and AI community continues to grow each year, bringing new innovations around data engineering, data science and machine learning. This year the virtual Spark + AI Summit 2020 was the largest community event ever with almost 70,000 registrants!

I would like to share with you how the summit went, then the different updates announced during the event and the main trends I followed.

This year has also been a special year for the Spark community for two reasons : The release of Spark 3.0 and Spark’s 10th anniversary.

Keynotes:

Every morning and afternoon started with a keynote and as all other summit or tech events: road map, new features, demo or client’s project are presented.

Three interesting presentations were given by the main actors of Databricks & Spark:

Ali Ghodsi — Realizing the Vision of the Data Lakehouse
Matei Zaharia — Spark 3.0 & MLflow Community and Product Updates
Reynold Xin — Introducing Delta Engines

Main keynotes’ speakers — https://databricks.com/sparkaisummit/north-america-2020

All Keynotes are available here.

What’s new?

The summit is an opportunity to share new updates or features. A few were announced during the summit:

Introduction of Delta Engine: accelerates the performance of Delta Lake for SQL and data frame workloads to make it easier for customers to adopt and scale a lakehouse architecture.
Acquisition of Redash the company behind the open source project of the same name: Redash is collaborative visualization and dashboarding platform designed to enable anyone, regardless of their level of technical sophistication, to share insights within and across teams. SQL users leverage Redash to explore, query, visualize, and share data from any data sources.
Introducing Koalas 1.0: It now implements the most commonly used pandas APIs, with 80% coverage of all the pandas APIs. In addition, Koalas supports Apache Spark 3.0, Python 3.8, Spark accessor, new type hints, and better in-place operations.
Introducing the Next-Generation Data Science Workspace: The next-generation Data Science Workspace on Databricks navigates these trade-offs to provide an open and unified experience for modern data teams.
MlFlow + Linux Foundation: After Delta last year, LinuxFoundation will host the open source project MlFlow, in order to make MlFlow the Open Standard for Machine Learning Platforms.

Training:

From June 22nd to 23rd, training workshops were held that included a mix of instructions and practical exercises to help you improve your skills on Apache Spark.

Different themes were proposed, with each session lasting for three hours:

Spark: Introduction to apache spark programming, Apache spark tuning and best practices
Datascience: Introduction to reinforcement learning, Scaling deep learning with tensorflow and apache spark, Apache spark for machine learning and data science, Mlflow: managing the machine learning lifecycle
Delta: Building better data pipelines for apache spark with delta lake
SQL on Databricks
Stream: Structures streaming with Databricks
Introduction to unified data analytics for managers
Databricks administration

Certification:

A online session was available on the 23 of June to pass the Databricks Certified Associate for Apache Spark 2.4 with, you had the opportunity to choose the Half-Day Prep Course Databricks Certification Exam for $200.

The Databricks Certified Associate for Apache Spark 2.4 validates your knowledge of the core components of the DataFrames API and also validates that you have a rudimentary knowledge of the Spark Architecture.

With the new release of Spark 3.0, you can now pass as well the Databricks Certified Associate Developer for Apache Spark 3.0.

A series of sessions

As usual, hundred of sessions was available, split into differents topics: Analytics, Apache Spark Use Cases, Architecture, Databricks Tech Talks Deep Learning, Hands on tutorials, Machine Learning, Python, Sponsored Sessions, Technical Deep Dives, & Technical vs Non-Technical Techniques.

Spark 3.0

The summit was the occasion to present the new big release of Apache Spark — Spark 3.0.0.

Here are the biggest new features in Spark 3.0:

2x performance improvement on TPC-DS over Spark 2.4, enabled by adaptive query execution (AQE), dynamic partition pruning and other optimizations
ANSI SQL compliance
Significant improvements in pandas APIs, including Python type hints and additional pandas UDFs

https://databricks.com/fr/blog/2020/06/18/introducing-apache-spark-3-0-now-available-in-databricks-runtime-7-0.html

You can find the full list of new feature through this article or deep dive into Spark 3.0 by watching the following session:

Deep Dive into the New Features of Apache Spark 3.0 by Xiao Li and Wenchen Fan (both Software Engineer at Databricks)
What’s new in Spark 3.0 — Training by Doug Bateman (Software Engineer at Databricks)
Pandas UDF and Python Type Hint in Apache Spark 3.0 by Hyukjin Kwon (Software Engineer at Databricks)
Deep Dive Into GPU Support In Apache Spark 3.X by Robert Evans and Jason Lowe (both Software Engineer at NVIDIA)

Apache Spark 3.0.0 release is also available on Databricks as part of the Databricks Runtime 7.0.

Apache Spark & Big Data community

Spark+AI Summit is the main event to start into Apache Spark™ and its community.

More than Databricks, the event regroups session about Apache Spark or other Apache products interacts with it.

*Apache Spark — Lightning-fast unified analytics engine*

Getting Started Contributing To Apache Spark — From PR, CR, JIRA, And Beyond by Holden Karau (IBM) — if you want to start contributing on the project, you can start here.
The Apache Spark File Format Ecosystem by Vinoo Ganesh (CTO of Veraset).
Fine Tuning And Enhancing Performance Of Apache Spark Jobs by Kira Lindke, Blake Becerra and Kaushik Tadikonda (Mathematician and Software Engineer at IBM).
On Improving Broadcast Joins in Apache Spark SQL by Jianneng Li (Software Development Engineer at Workday).

Delta

Delta Lake is an open-source storage layer that brings ACID transactions, over Parquet format, to Apache Spark™ and big data workloads.

Delta lake uses versioned Parquet files to store your data in your cloud storage. Apart from the versions, Delta Lake also stores a transaction log to keep track of all the commits made to the table or blob store directory to provide ACID transactions.

Open-source storage layer that brings ACID
transactions to Apache Spark™ and big data workloads — https://delta.io/

A lot of sessions during the Summit was on Delta, just some of them that you should watch:

Best Practices For Building Robust Data Platform With Apache Spark And Delta by Vini Jaiswal (Customer Success Engineer at Databricks)
Building The Petcare Data Platform Using Delta Lake And ‘Kyte’: Our Spark ETL Pipeline by George Claireaux & Kirby Prowting (both Data Engineer at Mars)
Patterns And Operational Insights From The First Users Of Delta Lake by Dominique Brezinski (Apple)

For more informations, don’t hesitate to join the different communities:

Lakehouse architecture

https://databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html

Just before the summit, Databricks announced Delta Engine, which ties together a 100% Apache Spark-compatible vectorized query engine to take advantage of modern CPU architecture with optimizations to Spark 3.0’s query optimizer and caching capabilities that were launched as part of Databricks Runtime 7.0. Together, these features significantly accelerate query performance on data lakes, especially those enabled by Delta Lake, to make it easier for customers to adopt and scale a lakehouse architecture.

The research paper on the inner workings of the Lakehouse was accepted and published at VLDB’2020.

Integration with other environments

Spark & AI Summit was also the opportunity to discover uses cases or projects who implement Spark into a real architecture. It’s a great opportunity to discover use cases and interactions with other environnements and tools.

Here some useful architectures or technologies implementations:

Operationalize Apache Spark Analytics by Ivan Nardini and Artem Glazkov (Customer advisor and Consultant at SAS Institute).
Building Identity Graphs Over Heterogeneous Data by Sudha Viswanathan and Saigopal Thota (Big Data Engineer and Data Scientist at Walmart Labs).
Managing ADLS Gen2 Using Apache Spark by Jacek Tokar (Data Engineer at Procter and Gamble) — the repository of OctopuFS is available here.
Sputnik: Airbnb’s Apache Spark Framework for Data Engineering by Egor Pakhomov (Senior Software Engineer at AirBnB)

All videos and supports are already available.

If you want to learn more about Spark, Delta or MlFlow I invite you to check the documentation, the databricks blog or the databricks academy for deep dive training.