Delta Paradise for Data Engineers

Prakash Satyanaga
3 min readFeb 11, 2022

--

Being a consultant I had to embrace new platforms as per the customers’ requirements and level of expertise they have in their environment. So I was FORCED to use Databricks. Learning databricks was one big learning for me and that is how I started using it.

Was it tough to learn? Yes it was and it was not easy (for me) to digest in using Databricks as I come from different Processing background.

Whenever you feel uncomfortable that implies you will be going to enjoy in learning it. Initially all I thought about Databricks was just another Spark cluster with Delta as the star in the middle, but all my assumptions were wrong and immature with additional features Like Autoloader, Delta Streaming, Table ACL, Lakehouse and now Unity Catalog. (There are still more to offer but I am leaving them at this moment)

With all these Awesome features, a Data Engineers life become easy with one more tool to learn and enjoy.

My blog is not a technical one rather it’s about features offered by databricks in data engineering space and how much I enjoyed in using it.

1) Autoloader:

What is it? As data bricks defines it is incremental way to process new files as they arrive in cloud object storage.

What’s so special about it? It offers source which are cloudfiles in a streaming fashion so whenever there are new or modified files you will get them. It can also reduce your listing on object storage. And it also reads already existing files. It can identify schema by inferring it based on certain conditions.

Is that it? It can also handle your schema changes from upstream and provide you a cleaner and efficient way to load it to the next stage. Isn’t that cool.

2) Delta Streaming:

With Delta lake in play and people following Medallion architecture to move their data from Raw to presentation layer. We don’t need too much of work on moving data from one table to another with all changes being pushed seamlessly.

Delta lake has the feature of streaming the data being idempotent and pushing with checkpoint, this has been a big game changer in moving data to intermediate and presentation portion.

3) Table Acl:

Table ACL is primarily used for fine grain access to your tables and follow the general SQL based permissions and it also allows some of the Pyspark Dataframe API with limitations but can be integrated with your Data platform for analytics engineers and need not worry about data access. It does has its own limits but eventually it will mature itself with unity catalog.

4) Lakehouse:

The whole concept of mixing Data Lake and Data warehouse and get the best of both in single place is way to go forward. It does involve broader understanding of Business outcomes and platform outcomes so that consumers can move at quick pace to get What they want and deliver the goals.

5) Unity Catalog:

Managing Datalakes from the point of security and governance purpose has been a bigger headache where you have different data providers and cloud vendors making it harder to tackle different services on its own. Using Unity Catalog we get to audit and manage the data across the Datalakes (Lakehouse). Again Delta is the star in this particular place as we can leverage delta sharing and reap on many benefits out of it.

Credits: All the images are from various Databricks blog which I have used during my initial journey in Databricks.

--

--