The Modern Cloud Data Platform war — DataBricks (Part 1)

Modern Cloud Data platform War — DataBricks First part

LAKSHMI VENKATESH
Data Arena
8 min readJul 25, 2021

--

This article is a part of a multi-part series Modern Cloud Data Platform War (parent article). Next part — Modern Cloud Data Platform War — DataBricks (Part 2) — Data Fluctuations.

Why do I want to mention DataBricks to be the first platform for the Modern Cloud Data Platform?

I genuinely think Delta Lake will be adopted by more and more organizations and Delta Lake is the future, especially with the Lakehouse architecture where for the organization we can build a unified platform for all the organization’s Data, Big Data Analytics, and AI workloads.

For every organization whether it is big or small, it is a good idea to start with the Data Lake first, baby Data Lake if needed, and with near to infinite capacity offered in the storage by most of the cloud providers especially AWS S3. This Data Lake can be the center for all your data needs. Data Lake as such is a single version of the truth, where you can build “Bronze, Silver and Gold” Data sources in the data Lake.

Image by the author

What are these Bronze, Silver, and Gold and why is it needed?

From Operational Database, upstream systems, streams, files, etc., the data goes into the Bronze, which is the Raw Data. You may need to clean the data, synergies, and normalize the data based on identifies where you can perform and move to Silver. Finally, once the data is cleaned and processed and is ready for further use by a business or IT is the Gold bucket.

Image by the author

Now, there should be a single version of the truth and all the users should be able to source the information from these buckets as applicable. However, there could be challenges as to how will you restrict this data to be shared by different users? This is the source of the problem for Data Lake to be converted into Data Swamps. This can be avoided in multiple ways. One of the ways Data Bricks addresses is using the Unity Catalog, which shall be discussed later in this article.

Challenge 1: Massive data input:

Data sourcing and ingestion from 100’s of places out of which the below listed 3 regions load petabytes of data as these are central partner locations.

Image by the author

Company X has different data sources and is ingested from 100’s of places. How Delta Lake handles Data Pipelining?

Company X is a strong data-driven enterprise and gets data from multiple upstream systems and heavy online data. Their current Data Pipeline and ETL are convoluted and do not scale as per the needs. The main pain-point Company X wants to solve is efficient and quick processing of data and to address the scale of data mainly from the 3 sources as of today and more in the future.

Image by the author. Company X’s current design

Solution 1: DLT — Delta Live Tables:

Use Case: Data Ingest and ETL

DLT — Delta Live Tables by DataBricks makes it easy to build and manage reliable Data Pipelines. You can consider this as ETL (Extract, Transform, and Load) with declarative pipeline development, automatic data testing, and deep visibility for monitoring and recovery.

Image by the author

As we saw earlier, the foundation of Lakehouse architecture is having Bronze — raw data; Silver — filtered, cleaned augmented data, and Gold — Business level aggregates. This is the simplest form. But in reality, as the producers increase and consumers increase and if we are not adopting any of the modern features such as Unity Catalog, we may end up having multiple Bronze, Silver, and Gold buckets. This makes it difficult to maintain a reliable version of data and the Data Lake will soon end up being Data Swamp. In order to preserve the single version of the truth and the reliability of data, Databricks announces “Delta Live Table”, a reliable ETL made easy with Delta Lake.

What is Delta Live Table:

Delta Live Table, as the name suggests, shares the live data as and when some changes happen to the underlying data set. It is as simple as “CREATE LIVE TABLE” on the source.

Image by the author

How to build Data Pipeline using Delta Live Table?

You can use “CREATE LIVE TABLE” and provide the “Data source” and the transformation logic and destination state of the data. Now, it solves the problem of moving data from multiple upstream sources by using SFTP, Streaming and Queues, External tables, Database Links, etc., instead directly query the source data using the DLT and put it to destination. This also helps to reuse this ETL pipeline several times and avoids complicated stitching of siloed data processing workloads. To incrementally load each of these live tables, we can run batch or streaming jobs.

Building the Bronze, Silver, and Gold Data Lake can be based on the approach of Delta Live Tables. Ok, but how do we ensure that the data that goes into these tables are tested and of good quality? Automatic Testing is the answer that ensures that bad data is filtered and useful, accurate data that goes into each of these buckets and is further used by different types of users for Query, BI, Data Science, and Machine learning purposes. This can be the foundation stone for building the Lakehouse architecture. Also provides the ability to monitor and understand the quality of data and enables recovery.

Source from Data Bricks

Key Features:

1. Delta live tables easily helps to build and maintain your organization’s data pipeline

2. Live table understands your dependencies

3. Helps Automatic Testing

4. It does automatic monitoring and recovery

5. Enables automatic environment independent data management

a. Different copies of data can be isolated and updated using the same code base.

6. Treat your data as code

a. Enables automatic testing

b. Single source of truth more than just transformation logic

7. Provides live updates.

How does it work:

Delta Live Table enables you to creates live tables directly on the underlying file/table ie., the source data. If there are any modifications to the underlying data source, the same will be reflected in the Live tables.

What problem does it solve:

From Query to Production, though with Delta Sharing and Unity Catalog is a simple and easy job (discussed later), sharing data and gathering analytics on terabytes/exabytes of data that should reflect the live updates every time there have been updates to the underlying (something like Material Views) on loads of data is not an easy task — it has so many operational challenges and will lead to performance bottlenecks.

Source: Databricks Data + AI Summit 2021

Key challenges it solves:

1. Enable data teams to innovate rapidly

2. Ensure useful and accurate analytics, BI with top-notch data quality

3. Ensures single version of the truth

4. Adopts to organization growth and the new addition of data.

6. Data Bricks Machine Learning

Solution 2: Spark scalable pipelines:

Similar to having baby Data Lakes, having nimble and having baby Spark processes and grow as the data grows is a good idea that is useful for processing batch and streaming data.

Image by the author
  • Simple architecture
  • Robust data pipelines
  • Reduced compute times
  • Elastic cloud resources with auto-scale up and down based on workloads
  • Modern data platform and Engineering

Key Features:

  1. Ability to keep scaling as your data grows. Having Spark Big Data as the de-factor for processing speeds and provides the ability to use the Spark Ecosystem for Big Data, Machine Learning, and all others that is supported by Spark.
  2. Improve Data readability and use with Data Lake.
  3. Can be used for both batch and streaming.
  4. Continuous refinement and works from small data to massively huge data using the same data flows and workloads.
  5. Use of fully managed Spark-Clusters.
  6. Full support for CI/CD using Github, Jenkins, etc.
  7. Use of data by Business users, Data Analysts, Data Engineers, Data Analysts, and Data Scientists.

Spark and Acid Transaction:

How about having an ACID Transaction on top of Spark? Spark addresses all the problems of scale, performance, data pipelines, etc., and ACID solves the problem of consistency. That is the promise of Databricks. Along with the ACID consistency, the lakehouse architecture also supports partitioning, indexing, schema validation, and handling large metadata.

Source: Data Bricks

The single most important transition that has happened with the advent of Data Lake is for the organization is creating a massive staging area and operating out of the data lake. This essential shift has created one more layer on top of it that is the Lakehouse architecture that enables an open format, common storage of data with varying quality and process points of data such as Bronze, Silver, and Gold and provides a neat and cleanability to build BI / Streaming Analytics / Data Science or ML solutions out of it. For the next level of processing take the data as it is suitable — be it bronze or silver or gold. The thought process of re-using the streaming pipeline to build applications has shifted to using Bronze, Silver, and Gold Data sources.

Summary:

Data Bricks is trying to provide a simple answer for all your organization's data needs. To handle massive data inputs Data Bricks (1) Delta Live Tables or (2) Spark Scalable pipelines can be used to address the challenge.

--

--

LAKSHMI VENKATESH
Data Arena

I learn by Writing; Data, AI, Cloud and Technology. All the views expressed here are my own views and does not represent views of my firm that I work for.