Data+AI Summit Europe 2020 Overview
Spark+AI 2020 European edition was held online and freely, just as its 2020 Summit counterpart was before due to the exceptional circumstances imposed by the covid-19 pandemic.
Furthermore, this bi-annual conference held by Databricks was rebranded Data + AI Summit Europe 2020.
I would like to share with you how the summit went, then the different updates announced during the event and the main trends I followed and recommand.
My first Data+AI Summit as a speaker:
It was a great honor to be selected as a speaker at the Data&AI summit, with Kaoula Ghribi, to talk about one of our project at SNCF:
“Building a Streaming Data Pipeline for Train Delays Processing”
Don’t hesitate to watch our session: https://databricks.com/session_eu20/building-a-streaming-data-pipeline-for-trains-delays-processing.
Keynotes:
Every morning and afternoon started with a keynote and as all other summit or tech events: road map, new features, demo or client’s project are presented.
Three interesting presentations were given by the main actors of Databricks & Spark:
- Matei Zaharia and Sue Ann Hong: Simplifying Model Development and Management with MLflow
- Ali Ghodsi: Realizing the Vision of the Data Lakehouse
- Reynold Xin: Delta Engine: High Performance Query Engine for Delta Lake
I found another intervention really interesting:
- The passionate presentation by Dr. Mae Jemison on a mission to pursue an extraordinary future and sharing how it can be achieved. Check her project : https://100yss.org/.
What’s new?
The summit is an opportunity to share new updates or features. A few were announced during the summit:
- The launch of SQL Analytics after the acquisition of Redash: SQL Analytics provides a simple experience for SQL users who want to run quick ad-hoc queries on their data lake, create multiple visualization types to explore query results from different perspectives, and build and share dashboards.
- Spark on Kubernetes will officially be declared Generally Available and Production-Ready with the upcoming version of Spark (3.1) to be released in December 2020 or January 2021.
- Introducing Koalas 1.4: It now implements the most commonly used pandas APIs, with almost 85% coverage of all the pandas APIs. In addition, Koalas supports Apache Spark 3.0, Python 3.8, Spark accessor, auto-completion & static error checking,new type hints, and better in-place operations.
- Python Autocomplete Improvements: Notebook enhancements for Python autocomplete, docstrings, and Koalas library.
Training:
The 17th of November, training workshops were held that included a mix of instructions and practical exercises to help you improve your skills on Apache Spark.
Different themes were proposed, with each session lasting for a half or a full day.
This year I decided to follow the ‘Performance Tuning on Apache Spark’ training — a one day deep dive into some of the most significant performance problems associated with developing Spark applications.
Some many good input in just one day.
A series of sessions
As usual, hundred of sessions was available, split into diferents topics: Analytics, Apache Spark Use Cases, Architecture, Databricks Tech Talks Deep Learning, Hands on tutorials, Machine Learning, Python, Sponsored Sessions, Technical Deep Dives, & Technical vs Non-Technical Techniques
Spark 3.0 & APIs
You can find the full list of new feature through this article or deep dive into Spark 3.0 by watching the following session:
- Extending Apache Spark — Beyond Spark Session Extensions by Bartosz Konieczny (data engineer at Octo Technology)
- What is New with Apache Spark Performance Monitoring in Spark 3.0 by Luca Canali (data engineer at CERN)
- Comprehensive View on Date-time APIs of Apache Spark 3.0 by Maxim Gekk (software engineer at Databrick)
Spark SQL:
- Spark SQL Beyond Official Documentation by David Vrba (senior machine learning engineer at Socialbakers)
- Spark SQL Join Improvement at Facebook by Cheng Su (software engineer at Facebook)
- Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and Parquet Reader by Ke Sun and Jun Guo (senior engineer & data engine team lead at Bytedance)
Improving Pyspark:
- Project Zen: Improving Apache Spark for Python Users by Hyukjin Kwon ( software engineer at Databricks) — participate on the project or check more information.
- Optimizing Apache Spark UDFs by Shivangi Srivastava (senior engineering leader at Informatica)
You can find a full documentation on Apache Spark 3.0 over here.
Apache Spark 3.0.0 release is also available on Databricks as part of the Databricks Runtime 7.0.
Lake House Architecture
New systems are beginning to emerge that address the limitations of data lakes. A lakehouse is a new, open architecture that combines the best elements of data lakes and data warehouses:
- Make most recent data available quickly and with the least amount of effort / cost
- Store data in a format that is reliable and performant
- Join internal and external data sources to derive deeper trends and insights
- Enable datalake data to be accessible directly via SQL or Python
- Tune data to be performant for most commun use cases
- Avoid having to move the data to other locations unless absolutely necessary
- Achieving Lakehouse Models with Spark 3.0 by Simon Whiteley (cloud solution architect at Advancing Analytics) — One of my favorite talk of the summit (Kimball problems approaching and how to use SCD and SQL Merge)
- Radical Speed for your SQL Queries with Delta Engine by Todd Greenstein (software engineer at Databricks)
- Designing and Implementing a Real-time Data Lake with Dynamically Changing Schema by Mate Gulyas and Shasidhar Eranti (practice lead and solution architect at Databricks)
Don’t hesitate to check official paper documentation over here:
And even the research paper — Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics.
Delta
Delta Lake is an open-source storage layer that brings ACID transactions, over Parquet format, to Apache Spark™ and big data workloads.
The release of Delta Lake 0.8.0 came out a few weeks ago.
The following presentations provide an in-depth look at the technology:
- Designing and Implementing a Real-time Data Lake with Dynamically Changing Schema by Mate Gulyas & Shasidhar Eranti (Data Engineer at Databricks)
- Delta Lake: Optimizing Merge by Justin Breese (solutions architect at Databricks)
- Delta: Building Merge on Read by Justin Breese (solutions architect at Databricks)
- Diving into Delta Lake: Unpacking the Transaction Log by Denny Lee and Burak Yavuz (Developer Advocate & Software Engineer at Databricks)
Integration with other environments
Spark & AI Summit was also the opportunity to discover uses cases or projects who implement Spark into a real architecture. It’s a great opportunity to discover use cases and interactions with other environnements and tools.
Here some useful architectures or technologies implementations:
- The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analytics Engineer by Jeremy Cohen (product manager at Fishtown Analytics)
- Building a Cloud Data Lake with Databricks and AWS by
Denis Dubeau & Igor Alekseev (solution architect at Databricks — solution architect at AWS) - Building Identity Graph at Scale for Programmatic Media Buying Using Apache Spark and Delta Lake by Sneha Chokshi & Bikash Singh (both data engineer at MIQ Digital)
- Getting Started with Apache Spark on Kubernetes by Jean-Yves Stephan & Julien Dumazert (both Co-Founder of Data Mechanics)
- Using Tableau to Analyze Your Data Lake by Blair Hutchinson (product manager at Tableau)
Meetup
Data&AI summit was also the occasion to follow two 1h30 meetup:
- Wednesday — Apache Spark™ 3.0 Deep Dives Meetup: recap summit keynote highlights, and personal session picks presented by Jules Damji and Denny Lee (developers advocates at Databricks). Jacek Laskowski (full stack engineer at Twilio) spoke about Spark 3.0 internals, and Scott Haines (independent consultant) discussed structured streaming microservice architectures.
- Thursday — MLflow, Delta Lake and Lakehouse Use Cases Meetup: recap top announcements and favorite sessions followed by talks about MLflow, Delta Lake and Lakehouse from Andre Mesarovic, Oliver Koernig, Denny Lee and Jules Damji (developers advocates & solutions architects at Databricks)
Don’t hesitate to follow Data Brew podcast.
All videos and supports are already available.
If you want to learn more about Spark, Delta or MlFlow I invite you to check the documentation, the databricks blog or the databricks academy for deep dive training.