Delta Live Tables : Simplify the ETL Process

Riya Khandelwal
6 min readApr 13, 2023

--

Databricks Delta Live Tables provide one of the key solution to build and manage, reliable and robust data engineering pipelines that can load the Streaming and batch data and deliver high-quality data on the Lakehouse Platform.

DLT not only helps an engineer to simplify the ETL development process by building maintenance-free pipelines, but also provides a reliable automatic data testing capability, with deep visibility for monitoring and recovery of the data.

Editorial’s Section:

  • Medallion Lakehouse Architecture
  • Complexity in Data Delivery
  • What is Delta Live Tables
  • Why DLT stands out as a differentiator

Medallion Lakehouse Architecture:

In medallion Lakehouse Architecture, the source data is passed through multiple Data ingestion layer before it is processed and made available to the downstream team. The three Data Ingestion Layers are:

  • Bronze Layer: This layer stores the data in it original format(Raw data) received from multiple sources. It’s used by an organization for audit purpose and to trace back to the data source.
  • Silver Layer: Silver layer is responsible for cleaning and filtering the Bronze data. It handles the missing data and also does the data type conversation if required. It converts the nested objects into the flat structures for ease of querying. Also, in this layer column renaming is done as per the business needs.
  • Gold Layer: This layer contains the transformed, aggregated and modeled data which can be made available for SQL queries. It creates the business-specific model and aggregates based on Dimensions and Facts.
Lakehouse General Architecture

To know more on Databricks Lakehouse, check out my previous blog Lakehouse — Integration of Data Warehouses and Data Lakes

Complexity in Data Delivery:

Data is critical for any of the business outcome. And organizations rely on data analytics to predict the same. We working as a Data Engineer, Data Scientist or Data Analyst interacts with data in some or the other way.

Data Scientists predict the output based on the data supported and provided by the Data Engineer. But Data Engineers has to go through a bit of complex delivery to make the data available to the consumers. They have to decide on multiple factors including the data handling from multiple sources, working on data transformation and standardizing and they are using multiple tools to achieve this.

Complexity of Data Delivery

The following problems can be faced while maintaining data quality and consistency with large scale solution data pipelines:

  • Difficult to maintain table dependencies and to switch between streaming and batch load.
  • Error handling and recovery is time consuming
  • Pipeline failure may impact the downstream system and the team relying on it.
  • Data engineers have to focus on tools instead of doing the development because operational complexity dominates.

What is Delta Live Tables (DLT)?

Delta Live Tables or DLT is one of the best way to do ETL process in Lakehouse. It’s responsible for creating, maintaining and testing the data flow pipelines using a declarative approach, which helps data engineer to focus more on the getting the value out of data rather than focusing on tools.

Delta live tables is solely responsible for performing data transformation and managing the task orchestration, cluster management, monitoring, data quality, and error handling.

DLT Performing Data Engineering Activities

Key benefits of working with Delta Live tables are:

  • Accelerates the ETL Process: Declare SQL or Python and DLT automatically orchestrates the DAG, handles retries, changing data
  • Automatically manage your infrastructure: Automatically manages complex tedious activities like recovery, auto-scaling, and performance optimization against the workflow.
  • Ensure high data quality: Deliver reliable data with built-in quality controls, testing, monitoring, and enforcement.
  • Unify batch and streaming: Get the simplicity of SQL with freshness of streaming with one unified API

Why DLT stands out as a differentiator

With Delta Live tables we can easily define end-to-end data pipelines in SQL or Python. We can declare the data source, the transformation logic, and the destination state of the data — instead of manually stitching together siloed data processing jobs.

DLT also automatically maintain all data dependencies across the pipeline and reuse ETL pipelines. Lets discuss why DLT stands out as a differentiator in Data Engineering world —

  • Effective Continuous or scheduled data ingestion: Auto Loader feature of DLT incrementally and efficiently load the data into Delta Lake as they as they arrive in cloud storage. It also infer schema of incoming files and evolve schema dynamically depending on the input file
  • Simple Declarative SQL & Python APIs: DLT helps data engineer to emphasize more on data transformation and applying business logic rather than work on error handling and data recovery. It also automatically generate lineage based on table dependencies across the data pipeline. Additionally checks for errors, missing dependencies and syntax errors.
  • Easy Data Quality Validation and Monitoring: Delta Live Tables defines data quality and integrity controls within the pipeline with data expectations. It also mentions data quality errors with flexible policies like fail, drop, alert, quarantine(future). All data pipeline runs and quality metrics are captured, tracked and reported which helps an individual to perform ground level monitoring and data Validation.
  • Data Pipeline Observability: DLT generates high-quality, high-fidelity lineage diagram that provides visibility into how data flows for impact analysis. Also provides logging for operational, governance, quality and status of the data pipeline at a row level. It helps to continuous monitor the data pipelines and generates notification using SQL databricks.
  • Automated ETL Operations: It reduce down time with automatic error handling and easy replay capacity. DLT auto-scaling adds more resources automatically when needed also eliminates the need of maintenance with automatic optimizations of all Delta Live Tables.
  • Workflow Orchestration: Most importantly , it simplifies orchestration and management of data pipelines. DLT orchestrate and manage workloads in multi-cloud environments. We can run a Delta Live Tables pipeline as part of a data processing workflow with Databricks jobs, Apache Airflow, or Azure Data Factory as per the requirements.

The best part is we have fully integrated solution in Databricks platform, making inspecting results, debugging faster :)

Summary:

This walks us to the end of the article where we discussed about the traditional Medallion Lakehouse Architecture followed by most of the organization. We also pointed out the current problem faced by an Data Engineer to make the data available to the consumer. We also discussed on what is Delta Live Tables and how working with Delta Live Table can overcome the current data delivery complexity.

Follow my Medium handle for such amazing articles on hot topics of Cloud offerings

About Me:

I am working as a Cloud Data Engineer at IBM and love to write about snowflake and other cloud offerings.. I have been working on Microsoft Azure Cloud as a part of my job & have knowledge of IBM Cloud, Amazon Web Services . I am Azure Certified Data Engineer, AWS Certified Developer and recently completed Snowflake’s Pro Core Certification .

Follow me on Medium to catch up new articles on various different cloud offerings. Feel free to connect with me on LinkedIn!

--

--

Riya Khandelwal

Data Engineer @ KPMG US | Data & Cloud Enthusiast - Snowflake; Azure | Tech Mantra - "Learn New Things Daily" | Connect - www.linkedin.com/in/riyakhandelwal