Data+AI Summit 2023 Overview

Published in

datalex

14 min readJul 30, 2023

Having spent three fantastic days in San Francisco attending the 2023 Data & AI Summit (often likened to the tech super bowl), I’m excited to share with you a comprehensive overview of the summit’s highlights, the array of updates unveiled during the event, and the sessions that captured my attention.

How the summit went?

Data & AI Summit (know as Spark + AI Summit) took place from the 26th to the 29th of June at Moscone Center, in San Francisco & Virtual from the 28th to 29th. This year, more than 12 000 people attended the conference.

The schedule included training sessions on the 26th and 27th, with the 27th specifically designated as the Partner Summit. The 28th and 29th were dedicated to the main conference.

You can follow conferences, from 8:00 AM to 5:00 PM. Around twenty concurrent conferences, which last 40 minutes, with a 20 minute break between them.

Conferences are grouped under different classification: Data Engineering, Data Governance, Data Lakehouse Architecture, Data Sharing, Data Strategy, Data Streaming, Data Warehousing — Analytics — and BI, Databricks Experience (DBX), DSML: ML Use Cases / Technologies, DSML: Production ML / MLOps.

Training:

On the 26th & 27th, training workshops took place that include a mix of instruction and hands-on exercises to help you improve your skills in Apache Spark, Delta, Unity or Databricks Architecture.

Different themes were proposed:

AWS Databricks Platform Architecture (half day session)
Azure Databricks Platform Architecture (half day session)
Introduction to Databricks SQL (half day session — Repeat)
Scalable Machine Learning with Apache Spark™ (2 days sessions)
Performance Tuning on Apache Spark™ (1 day session — Repeat)
Machine Learning in Production (1 day session — Repeat)
Introduction to Apache Spark™ (1 day session — Repeat)
Deep Learning with Databricks (2 days sessions)
Data Engineering with Databricks (2 days sessions)
Data Analysis with Databricks SQL (1 day session — Repeat)
Building and Deploying Large Language Models on Databricks (1 day session — Repeat)
Advanced Data Engineering with Databricks (2 days sessions)
Databricks Platform Administrator (1 day session)

Certification:

A testing room was available all four days of the Summit to pass the all kind of Databricks certification.

Keynotes:

Each day commenced with an impactful keynote, following the tradition of other prominent summits and tech events, where attendees were treated to roadmaps, introductions to new features, engaging demos, and real-life client success stories.

Both sessions are available for viewing, with the Wednesday one leaving a more impactful and interesting impression on me.

Wednesday Keynote: https://www.databricks.com/dataaisummit/session/data-ai-summit-keynote-wednesday/
Thursday Keynote: https://www.databricks.com/dataaisummit/session/data-ai-summit-keynote-thursday/

You can follow the great summary of the keynote made by Michael Segner : Databricks Data + AI Summit 2023 Keynote Recap: LakehouseIQ, Delta Lake 3.0, and More!.

Sponsors booths:

The booths serve as central hubs for networking, collaboration, and knowledge-sharing during the summit. Attendees have the opportunity to engage with representatives from various companies, ask questions, and witness live demonstrations of cutting-edge technologies. They can also access marketing materials, brochures, goodies, and samples to gain deeper insights into the products and services being offered.

Partner summit:

This year, I had the opportunity to participate in the Partner Summit (Datalex being a partner, which makes me feel proud). The summit featured several insightful sessions, recent announcements, in-depth discussions, and awards.

What’s new:

The summit is an opportunity to share new updates or features. This year big changes were announced:

Lakehouse AI:

Reenventing datawarehouse using LakehouseAI by making it easy, fast & cheap.

Lakehouse AI: unique data-centric approach, we empower customers to develop and deploy AI models with speed, reliability, and full governance.

LakehouseIQ: a knowledge engine that learns the unique nuances of your business and data to power natural language access to it for a wide range of use cases. Any employee in your organisation can use LakehouseIQ to search, understand, and query data in natural language.

Unity Catalog:

Lakehouse Federation: Discover, query and govern your data wherever it lives — access external systems & unify access federation — connect directly to other ecosystems.
Governance for AI: expanding the governance model within Unity Catalog to provide comprehensive management of both AI assets and data in a unified experience. This consolidation simplifies DataOps and MLOps processes, and prepares organisations for AI compliance, by bringing together all the necessary capabilities in one centralised location. Key enhancements include: MLflow Models, Features Tables & Volumes for unstructured file data.
AI for governance — Lakehouse Monitoring (sign up for preview) and Lakehouse Observability: make governance simpler with AI.
New assets in Unity Catalog: Delta Live Tables, Volumes, Row & column filtering (preview), Table auto-tuning (preview), Tags (preview), Catalog workspace binding, Hive Metastore sync, UNDROP, Hive Metastore API (preview), Materialized views (preview), Primary & foreign keys, Table insights, Clean Rooms (preview)
Apache Hive Interface for Unity Catalog: preview of a Hive Metastore (HMS) interface for Databricks Unity Catalog, which allows any software compatible with Apache Hive to connect to Unity Catalog.

Delta Lake 3.0:

Delta Universal Format (UniForm) enables reading Delta in the format needed by the application, improving compatibility and expanding the ecosystem. Delta will automatically generate metadata needed for Apache Iceberg or Apache Hudi, so users don’t have to choose or do manual conversions between formats. With UniForm, Delta is the universal format that works across ecosystems.

Delta Universal Format (UniForm) allows you to read Delta tables with Iceberg or Hudi reader clients.

Delta Kernel simplifies building Delta connectors by providing simple, narrow programmatic APIs that hide all the complex details of the Delta protocol specification.

The Delta Kernel project is a set of Java libraries for building Delta connectors that can read (and soon, write to) Delta tables without the need to understand the Delta protocol details.

Liquid Clustering simplifies getting the best query performance with cost-efficient clustering as the data grows.

Delta Sharing:

New features available: Sharing notebooks (GA), View Sharing (Public Preview), Sharing Schemas, Volumes for files (Private Preview), Sharing AI
New Partners to Accelerate Data Sharing: Expanding its data sharing ecosystem, Databricks announces partnerships with Cloudflare, Dell, Oracle, and Twilio.

Databricks Maketplace:

General Availability: an open marketplace for data solutions so any client that can read delta shares can access the marketplace. The Databricks Marketplace enables you to share and exchange data assets, including data sets and notebooks, in the public marketplace or private exchanges. Databricks Marketplace is open to non-Databricks users as well. Coming soon features — AI Model-Sharing Capabilities, Lakehouse Apps.

Spark:

English as the New Programming Language for Apache Spark: the English SDK for Apache Spark, a transformative tool designed to enrich your Spark experience. With the innovative application of Generative AI, English SDK seeks to expand this vibrant community by making Spark more user-friendly and approachable than ever. — “The hottest new programming language is English” — A. Karpathy

Tools (announced or highlighted):

Demo Center: Bite-size overviews. Interactive product tours. Hands-on tutorials. Explore all demos.
DiscoverX: a groundbreaking open source library that empowers data professionals with automated data classification and cross-table query capabilities within the Databricks Lakehouse
Databricks SDK for Go (beta)
Databricks SDK for Python (beta)

Acquisitions:

In the past few weeks, Databricks has made several acquisitions to accelerate its vision.

Mosaicml — Enable any company to build, own and secure best-in-class generative AI models: MosaicML is known for its state-of-the-art MPT large language models (LLMs). With over 3.3 million downloads of MPT-7B and the recent release of MPT-30B, MosaicML has showcased how organizations can quickly build and train their own state-of-the-art models using their data in a cost-effective way.
Okera — Adopting an AI-centric approach to governance: Okera solves data privacy and governance challenges across the spectrum of data and AI. It simplifies data visibility and transparency, helping organizations understand their data, which is essential in the age of LLMs and to address concerns about their biases.
Bit.io — Investing in the Developer Experience: bit.io was “the fastest way to get a Postgres database”. In order to start you just had to send data and your database was already setup. When looking at the press release Databricks acquisition is a team acquisition to improve their own developper experience.

A bunch of sessions:

You can follow breaking sessions, from 11:00 AM to 5:00 PM. Around twenty concurrent conferences, which last 40 minutes, with a 20 minute break between them.

Sessions are grouped under different classification: Data Engineering, Data Governance, Data Lakehouse Architecture, Data Sharing, Data Strategy, Data Streaming, Data Warehousing — Analytics — and BI, Databricks Experience (DBX), DSML: ML Use Cases / Technologies, DSML: Production ML / MLOps.

Here are the sessions I attended during the event.

All sessions are available here: https://www.databricks.com/dataaisummit/sessions/.

I’ll be concentrating on the main sessions that I attended during the summit event (or afterward) providing deep links and suggestions.

Databricks features deep dive:

For a more in-depth exploration of some of the latest features announced, make sure to check out the following breakout sessions:

Delta Live Tables A to Z: Best Practices for Modern Data Pipelines: technical deep dive into how Delta Live Tables (DLT) reduces the complexity of data transformation and ETL. Learn what’s new; what’s coming; and how to easily master the ins-and-outs of DLT — by Michael Armbrust (Databricks).
What’s New in Databricks Workflows: Databricks Workflows provides unified orchestration for the Lakehouse. Since it was first announced last year, thousands of organizations have been leveraging Workflows for orchestrating lakehouse workloads such as ETL, BI dashboard refresh and ML model training — by Muhammad Bilal Aslam (Databricks).
Deep Dive into the Latest Lakehouse AI Capabilities: By breaking down the silos between the data stack, ML stack and DevOps stack, Databricks offers a simplified, faster, and better-governed way to do ML, including integrated feature engineering and governance tooling, end-to-end tracking and lineage of models and data, automatic monitoring, and root cause analysis — by Ankit Mathur & Nicolas Pelaez (Databricks).
Building Apps on the Lakehouse with Databricks SQL: We’ve heard from customers that they experience an increasing demand to provide access to data in their lakehouse platforms from external applications beyond BI, such as e-commerce platforms, CRM systems, SaaS applications, or custom data applications developed in-house. These applications require an “always on” experience, which makes Databricks SQL Serverless a great fit — by Adriana Ispas & Chris Stevens (Databricks).
Databricks Marketplace: Going Beyond Data and Applications: The Databricks Marketplace is the ultimate solution for your data, AI and analytics needs, powered by open source Delta Sharing. Databricks is revolutionizing the data marketplace space — by Darshana Sivakumar & Mengxi Chen (Databricks).
Simplifying Lakehouse Observability: Databricks Key Design Goals and Strategies: Databricks vision for simplifying lakehouse observability, a critical component of any successful data, analytics, and machine learning initiatives. By directly integrating observability solutions within the lakehouse, Databricks aims to provide users with the tools and insights needed to run a successful business on top of lakehouse — by Michael Milirud (Databricks).

Lakehouse Platform, Unity Catalog:

The Lakehouse architecture is a data management approach that combines data lakes and data warehouses into a unified system. It addresses the limitations of each, providing better data governance, quality, and analytics performance. With the exponential growth of data, adopting the Lakehouse pattern has become essential for efficient data management and analytics implementation.

https://people.eecs.berkeley.edu/~matei/papers/2021/cidr_lakehouse.pdf

Unity Catalog, Delta Sharing and Data Mesh on Databricks Lakehouse: In this technical deep dive, we will detail how customers implemented data mesh on Databricks and how standardizing on delta format enabled delta-to-delta share to non-Databricks consumers — by Surya Turaga & Thomas Roach (Databricks).
Databricks As Code: How to Effectively Automate a Secure Lakehouse Using Terraform for Resource Provisioning: Rivian have automated more than 95% of our Databricks resource provisioning workflows using an in-house Terraform module, affording us a lean admin team to manage over 750 users — by Vadivel Selvaraj & Jason Shiverick (Rivian).
DataSecOps and Unity Catalog: High Leverage Governance at Scale: how to apply DataSecOps patterns powered by Terraform to Unity Catalog to scale your governance efforts and support your organizational data usage — by Zeashan Pappa & Deepak Sekar (Databricks).

Spark:

Here are just a few of the numerous Spark-related sessions:

Deep Dive into the New Features of Apache Spark™ 3.4: With tremendous contribution from the open source community, Spark 3.4 managed to resolve in excess of 2,400 Jira tickets. The major updates are Spark Connect, numerous PySpark and SQL language features, engine performance enhancements, as well as operational improvements in Spark UX and error handling — by Xiao Li & Daniel Tenedorio (Databricks).
Use Apache Spark™ from Anywhere: Remote Connectivity with Spark Connect: Today, every application, from web services that run in application servers, interactive environments such as notebooks and IDEs, to phones and edge devices such as smart home devices, want to leverage the power of data. However, Spark’s driver architecture is monolithic, running client applications on top of a scheduler, optimizer and analyzer. This architecture makes it hard to address these new requirements as there is no built-in capability to remotely connect to a Spark cluster from languages other than SQL — by Martin Grund & Stefania Leone (Databricks).

Delta lake:

Delta Lake is an open-source storage layer that brings ACID transactions, over Parquet format, to Apache Spark™ and big data workloads.

Delta lake uses versioned Parquet files to store your data in your cloud storage. Apart from the versions, Delta Lake also stores a transaction log to keep track of all the commits made to the table or blob store directory to provide ACID transactions.

Open-source storage layer that brings ACID
transactions to Apache Spark™ and big data workloads — https://delta.io/

A lot of sessions during the Summit was on Delta, just some of them that you should watch:

Delta Kernel: Simplifying Building Connectors for Delta: Based on lessons learned from this past year, we will introduce Project Aqueduct and how we will simplify building Delta Lake APIs from Rust and Go to Trino, Flink, and PySpark — by Denny Lee & Tathagata Das (Databricks).
Introducing Universal Format: Iceberg and Hudi Support in Delta Lake: In this session, we will talk about how Delta Lake plans to integrate with Iceberg and Hudi. With Universal Format (“UniForm”), Delta removes the need to make this compromise and makes Delta tables compatible with Iceberg and Hudi query engines. We will do a technical deep dive of the technology, demo it, and discuss the roadmap — by Himanshu Raja & Ryan Johnson (Databricks).
The Hitchhiker’s Guide to Delta Lake Streaming: How to take full advantage of Delta Lake streaming. You will be guided through Delta Lake streaming best practices, and learn to navigate the ins and outs, ups and downs, that are common to working with streaming data — by Scott Haines (Nike).

For more details, don’t hesitate to check the following link:

Delta Lake: The Definitive Guide: Authors Denny Lee, Prashanth Babu, Tristen Wentling, and Scott Haines explain how to harness the power of Delta Lake to increase your data productivity at scale.
Hitchhikers Guide: a collection (growing hopefully as time goes on) providing tips and tricks to ensure your experience building and maintaining Streaming Delta Lake applications.

Delta Sharing:

Delta Sharing is the industry’s first open protocol for secure data sharing, making it simple to share data with other organizations regardless of which computing platforms they use.

Advises session are as follow:

Data Sharing and Beyond with Delta Sharing: Delta Sharing was the world’s first open protocol for secure and scalable real-time data sharing. Through our customer conversations, there is a lot of anticipation of how Delta Sharing can be extended to non-tabular assets, such as machine learning experiments and models.— by Milos Colic & Vuong Nguyen (Databricks).
Sponsored by: KPMG | Multicloud Enterprise Delta Sharing and Governance using Unity Catalog at S&P Global: Many enterprise organisations face challenges in adopting these technologies effectively, as comprehensive cloud data governance strategies and solutions are complex and evolving — particularly in hybrid or multicloud scenarios involving multiple third parties. KPMG and S&P Global have harnessed the power of Databricks Lakehouse to create a novel approach— by Dennis Tally & Niels Hanson (KPMG).

Integration with other environments

Data & AI is also a great opportunity to discover use cases and interactions with other environnements and tools.

Here some useful architectures or technologies implementations:

Cross-Platform Data Lineage with OpenLineage: OpenLineage provides a standard for lineage collection. In this session, we will show how to trace data lineage across Apache Spark and Apache Airflow. There will be a walk-through of the OpenLineage architecture and a live demo of a running pipeline with real-time data lineage.— by Julien Le Dem & Willy Lulciuc (Astronomer).
Delta-rs, Apache Arrow, Polars, WASM: Is Rust the Future of Analytics?: What would a data stack that takes full advantage of these advancements would look like?— by Oz Katz (lakeFS).
Building a Minimalistic Open Lakehouse Using Open Source Projects Apache Spark™: Project Nessie and Iceberg: the idea is to help data engineers getting their leg into the world of data lakehouses, easily learn and implement it. We will go through a Notebook-style presentation to show beginners how to build a minimalistic functional lakehouse using Apache Spark, Project Nessie and Iceberg — by Dipankar Mazumdar (Dremio).

Don’t hesitate to go deeper in the following projects:

LakeFS: an open source data version control for data lakes. It enables zero copy Dev / Test isolated environments, continuous quality validation, atomic rollback on bad data, reproducibility, and more.
openLineage: an open platform for collection and analysis of data lineage. It tracks metadata about datasets, jobs, and runs, giving users the information required to identify the root cause of complex issues and understand the impact of changes.
Project Nessie: Nessie is to Data Lakes what Git is to source code repositories. Therefore, Nessie uses many terms from both Git and data lakes.
Dremio: a data lakehouse management service that enables data teams to manage data-as-code with Git-like operations, optimizes tables automatically, and provides a data catalog.
DuckDB: an in-process SQL OLAP database management system.

openLineage meetup:

The summit also presents an excellent opportunity to attend side events. I had the delightful experience of participating in the openLineage meetup for the first time:

https://www.meetup.com/fr-FR/meetup-group-bnfqymxe/events/293448130/.

All videos are already available and supports will be released soon.

If you missed the opportunity to attend the summit, fret not, as a part of it is coming to you — Join the World Tour: https://www.databricks.com/dataaisummit/worldtour

https://www.databricks.com/dataaisummit/worldtour

If you want to learn more about Spark, Delta or MlFlow I invite you to check the documentation, the databricks blog or the databricks academy for deep dive training.

Season 5 already start: Don’t hesitate to follow Data Brew podcast.

While attending the summit in San Francisco, I have the chance to make the most of my trip by embarking on a train journey across the United States from the West Coast to the East Coast. I took the California Zephyr, passing through Chicago and Washington D.C., before reaching my final destination in New York.

See you next year smog city!

Data+AI Summit 2023 Overview

How the summit went?

Training:

Certification:

Keynotes:

Sponsors booths:

Partner summit:

What’s new:

Lakehouse AI:

Unity Catalog:

Delta Lake 3.0:

Delta Sharing:

Databricks Maketplace:

Spark:

Tools (announced or highlighted):

Acquisitions:

A bunch of sessions:

Databricks features deep dive:

Lakehouse Platform, Unity Catalog:

Spark:

Delta lake:

Delta Sharing:

Integration with other environments

openLineage meetup:

Written by Alexandre Bergere