Summarizing lecture from Data+AI World Tour by Databricks: Delta Live Tables A to Z: Best Practices for Modern Data Pipelines

Varsha Hindupur
6 min readFeb 7, 2024

--

Data AI Summit 2024

So, Data+AI World Tour is a part of Data Summit conducted by Databricks. Being one of the industry-wide most used product, it is quite interesting to know what they are offering to make Real-Time data available. This is mostly for Data Engineers but if you’re interested in ‘Data World’ you should definitely read.

This lecture was offered by Dillon Bostwick, who is Lead Solution Architect at Databricks.

Let’s Begin! For starters, he is sharing why world needs Real-Time Analysis…

A noteworthy disclosure was presented by Databricks during the #dataaiworldtour on January 18, 2024, featuring insightful points from Dillon Bostwick. I became intrigued by the impact of Real-Time and near Real-Time processes, particularly when Bostwick highlighted TikTok utilization of real-time data, with just a one-second latency, to enhance model training and refine their recommendation system by putting this newly acquired analysis back into production. This distinguishes ByteDance in the industry. This development is remarkably impressive and emblematic of the current evolution within the industry. It is interesting to know how after COVID they actually thought about real-time scoring, training the model on newest data.

Use-Cases where Real-Time needed

Consider another scenario where seconds and even milliseconds are critical, such as in intrusion detection and fraud detection, where IoT sensors continuously transmit data, demanding prompt capture and analysis. Yet, in high-frequency trading, even nanoseconds and microseconds are of paramount importance. In this context, the length of cables between hedge fund servers and Stock Exchange servers is meticulously regulated. This is because a mere difference of a few feet in cable length could provide a hedge fund with a nanosecond advantage, enabling them to gain insight into the market before others and potentially make millions. This underscores the significance of distinguishing between true real-time and near real-time, emphasizing the crucial implications of timing precision in various use cases.

Batch vs Streaming Data

Transitioning from capturing data on a weekly, daily, or even minute-by-minute basis to a second-by-second frequency poses challenges for data engineers. What’s crucial to recognize is the necessity for architecting solutions that facilitate this transition seamlessly, without requiring a lengthy timeframe of 3–5 years. What CEOs and leaders require today is the ability to process data at the right time, tailored to specific business needs.

Delta Live Tables (DLT) offers a solution by providing an abstraction layer between the data engineering processes and the business logic being addressed. With DLT, querying the production pipeline becomes as simple as adding the keyword “LIVE,” streamlining the process of accessing data with a granularity ranging from weeks to minutes through a single click. This frees up time for users to concentrate on developing business logic and implementing specific business rules.

Introduction to DLT
Definition of Live Tables

Live Tables serve as materialized views within the lakehouse architecture. Delta Live Tables (DLT) provide a declarative representation of data, requiring users to specify the desired data structure purely at the business level. The engine then manages tasks such as determining refresh intervals, incremental computations, cluster requirements, and handling dependencies autonomously. Users simply declare their requirements, and the engine handles the implementation details.

In this context, tables represent persistent data, while views offer abstract representations without persistence. When querying a view, the engine dynamically manages persistence and delivers the required data.

Knowing when to use Delta Live Tables (DLT) is akin to understanding when to utilize materialized views. From a syntactical perspective, the “CREATE OR REPLACE” keyword doesn’t fully leverage the engine’s capability to comprehend user requirements.

# Previously
CREATE OR REPLACE TABLE report AS SELECT sum(profit) FROM prod.sales
# DLT
CREATE LIVE TABLE report AS SELECT sum(profit) FROM cloud_files (prod.sales)
# Streaming DLT
CREATE STREAMING LIVE TABLE report AS SELECT sum(profit) FROM cloud_files (prod.sales)

But Live keyword here is using DLT-as-a-Service at the backend.

The “Live” keyword in this context refers to utilizing Delta Live Tables (DLT) as a service at the backend.

A Streaming Live Table (SLT) ensures that data processing occurs in a single order and exactly once, a challenging task in distributed systems due to the large volume of data involved. Streaming typically implies dealing with unbounded or infinite data streams. SLT guarantees that data is neither duplicated nor skipped and maintains proper order. Its incremental nature allows for achieving lower latencies with the click of a button.

Despite its computational complexity, SLT can be cost-effective. By executing operations once a day or week with low latency, it becomes less expensive than non-streaming tables. SLT computes results over append-only streams like Kafka, Kinesis, or Auto Loader (files on cloud storage). Importantly, streaming tables can be transformed into Live tables, offering flexibility in data processing.

What more could be done? How will this improve my ETL pipeline?

DLT SQL Pipeline

Regular SQL queries often fall short in providing comprehensive visibility into data flows between tables, discovering metadata, assessing the quality of each table, accessing historical updates, and controlling operations. Delta Live Tables (DLT) offers a robust solution by enabling users to dive deep into events and efficiently debug issues.

DLT’s capabilities surpass those of traditional SQL queries, providing enhanced visibility into data flows and operations. It facilitates the discovery of bugs at precise locations within the data pipeline, a feature that SQL alone cannot achieve. Additionally, DLT has the potential to replace Airflow DAGs (Directed Acyclic Graphs) by offering superior visibility and control over data processing tasks. The engine underlying DLT comprehensively understands user actions, providing unparalleled visibility into data operations.

TP CDI comparison with Spark and Delta Lake v/s Delta Live Tables

During the TP CDI benchmark comparison between the regular Databricks Baseline Pipeline built with Spark and Delta Lake versus Delta Live Tables (DLT), DLT showcased its efficiency, as evidenced in the image where it effectively saturated CPU loads across an equivalent number of workers. This observation was made through the Ganglia UI, where darker boxes indicated heavier loads, highlighting DLT’s ability to optimize resource utilization.

DLT’s efficiency stems from its capability to comprehensively visualize Directed Acyclic Graphs (DAGs) and pipeline objects, allowing for more intelligent cluster management. This efficiency translated into a 2x Total Cost of Ownership (TCO) improvement, enhanced performance, and reduced costs. Moreover, DLT offers built-in event logging via event logs, facilitating automatic recording of pipeline operations and enabling the creation of KPI dashboards. Additionally, DLT enhances data quality and seamlessly integrates with other logging software.

One of the standout features of DLT is its support for dynamic pipeline creation through programmatically generating tables using Python, overcoming the limitations of traditional SQL. Furthermore, DLT allows for the implementation of stored procedures in Python, eliminating the need to learn additional languages, thus leveraging the familiarity and versatility of Python for pipeline development.

DLT’s Efficiency over regular ETL pipelines

Enzyme is a tool that automates and optimizes the incrementalization process in ETL (Extract, Transform, Load) pipelines. By analyzing various strategies and running a cost model, Enzyme can provide recommendations on how to efficiently create incremental ETL pipelines while minimizing costs. This automation allows users to streamline their ETL processes and make informed decisions on optimizing their data workflows for cost efficiency.

Introducing Enzyme

Hopefully, if you like this I can summarize more, this was a fantastic approach built by Databricks team.

Thanks for reading.

--

--

Varsha Hindupur

Hi, I'm a data aficionado, & I'm delighted to share my cumulative learning experience. If you've found it valuable, kindly share with your friends. Thank you!