Data+AI Summit Europe 2020 Overview

Alexandre Bergere
datalex
Published in
7 min readFeb 19, 2021

Spark+AI 2020 European edition was held online and freely, just as its 2020 Summit counterpart was before due to the exceptional circumstances imposed by the covid-19 pandemic.

Furthermore, this bi-annual conference held by Databricks was rebranded Data + AI Summit Europe 2020.

I would like to share with you how the summit went, then the different updates announced during the event and the main trends I followed and recommand.

My first Data+AI Summit as a speaker:

It was a great honor to be selected as a speaker at the Data&AI summit, with Kaoula Ghribi, to talk about one of our project at SNCF:

“Building a Streaming Data Pipeline for Train Delays Processing”

A Little journey to submit a talk to Data&AI Summit

Don’t hesitate to watch our session: https://databricks.com/session_eu20/building-a-streaming-data-pipeline-for-trains-delays-processing.

Keynotes:

Every morning and afternoon started with a keynote and as all other summit or tech events: road map, new features, demo or client’s project are presented.

Three interesting presentations were given by the main actors of Databricks & Spark:

I found another intervention really interesting:

Van gogh — la Nuit étoilée

What’s new?

The summit is an opportunity to share new updates or features. A few were announced during the summit:

  • The launch of SQL Analytics after the acquisition of Redash: SQL Analytics provides a simple experience for SQL users who want to run quick ad-hoc queries on their data lake, create multiple visualization types to explore query results from different perspectives, and build and share dashboards.
  • Spark on Kubernetes will officially be declared Generally Available and Production-Ready with the upcoming version of Spark (3.1) to be released in December 2020 or January 2021.
Spark-on Kubernetes improvements
  • Introducing Koalas 1.4: It now implements the most commonly used pandas APIs, with almost 85% coverage of all the pandas APIs. In addition, Koalas supports Apache Spark 3.0, Python 3.8, Spark accessor, auto-completion & static error checking,new type hints, and better in-place operations.
  • Python Autocomplete Improvements: Notebook enhancements for Python autocomplete, docstrings, and Koalas library.

Training:

The 17th of November, training workshops were held that included a mix of instructions and practical exercises to help you improve your skills on Apache Spark.

Different themes were proposed, with each session lasting for a half or a full day.

This year I decided to follow the ‘Performance Tuning on Apache Spark’ training — a one day deep dive into some of the most significant performance problems associated with developing Spark applications.

Some many good input in just one day.

A series of sessions

As usual, hundred of sessions was available, split into diferents topics: Analytics, Apache Spark Use Cases, Architecture, Databricks Tech Talks Deep Learning, Hands on tutorials, Machine Learning, Python, Sponsored Sessions, Technical Deep Dives, & Technical vs Non-Technical Techniques

Spark 3.0 & APIs

spark 3.0 performance

You can find the full list of new feature through this article or deep dive into Spark 3.0 by watching the following session:

Spark SQL:

Improving Pyspark:

68% of notebook commands on Databricks are in Python

You can find a full documentation on Apache Spark 3.0 over here.

Apache Spark 3.0.0 release is also available on Databricks as part of the Databricks Runtime 7.0.

Lake House Architecture

New systems are beginning to emerge that address the limitations of data lakes. A lakehouse is a new, open architecture that combines the best elements of data lakes and data warehouses:

  • Make most recent data available quickly and with the least amount of effort / cost
  • Store data in a format that is reliable and performant
  • Join internal and external data sources to derive deeper trends and insights
  • Enable datalake data to be accessible directly via SQL or Python
  • Tune data to be performant for most commun use cases
  • Avoid having to move the data to other locations unless absolutely necessary
What is a Lakehouse?

Don’t hesitate to check official paper documentation over here:

And even the research paper — Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics.

Delta

Delta Lake is an open-source storage layer that brings ACID transactions, over Parquet format, to Apache Spark™ and big data workloads.

The release of Delta Lake 0.8.0 came out a few weeks ago.

The following presentations provide an in-depth look at the technology:

Integration with other environments

Spark & AI Summit was also the opportunity to discover uses cases or projects who implement Spark into a real architecture. It’s a great opportunity to discover use cases and interactions with other environnements and tools.

Here some useful architectures or technologies implementations:

Meetup

Data&AI summit was also the occasion to follow two 1h30 meetup:

https://databricks.buzzsprout.com/1370119
  • Wednesday — Apache Spark™ 3.0 Deep Dives Meetup: recap summit keynote highlights, and personal session picks presented by Jules Damji and Denny Lee (developers advocates at Databricks). Jacek Laskowski (full stack engineer at Twilio) spoke about Spark 3.0 internals, and Scott Haines (independent consultant) discussed structured streaming microservice architectures.
  • Thursday — MLflow, Delta Lake and Lakehouse Use Cases Meetup: recap top announcements and favorite sessions followed by talks about MLflow, Delta Lake and Lakehouse from Andre Mesarovic, Oliver Koernig, Denny Lee and Jules Damji (developers advocates & solutions architects at Databricks)

Don’t hesitate to follow Data Brew podcast.

All videos and supports are already available.

If you want to learn more about Spark, Delta or MlFlow I invite you to check the documentation, the databricks blog or the databricks academy for deep dive training.

Databricks Academy — https://academy.databricks.com/

--

--

Alexandre Bergere
datalex
Editor for

Data Architect & Solution Architect independent ☁️ Delta & openLineage lover.