Data Engineering with Databricks: What, Why, How?

Capax Global
Hitachi Solutions Braintrust
4 min readAug 20, 2019

--

Everyone is talking about “data engineering” these days. What is it? And how is it different than what you’ve been doing as a database architect or developer? Read on and find out!

What is Data Engineering?

Most of us database architects and developers have been performing data engineering our entire career. It has just been in the last few years the term “data engineering” has become the de facto way to describe moving large amounts of data from multiple types of sources and then loading and transforming it for analysis by business and data scientists.

The term is also used to differentiate the work done by database developers in the data pipeline and the data science work done after data engineering. All the tools available to do data science on a large scale has prompted companies to start big machine learning projects. However, many of these projects are failing because the data is not clean, conformed, or useable. Hence, the current emphasis is now on data engineering to enable the data science and other reporting analyses companies need and want to make the data more accurate and more useable.

Simply put: data science is data wrangling.

ETL vs ELT

Having all the processing power available in the cloud, and the data lake to store all corporate data in one place, has changed the data movement pipeline we have used for years from extract — transform — load (ETL) to extract — load — transform (ELT). And, the transform part has become much more complex. Data can now come from anywhere, in any format, and may require complex transforms that tools like SSIS or Informatica may not support. Set-based transforms are not always the best way to achieve some of this work.

In ETL, the pipeline does the heavy lifting. This is where SSIS excels, but you will run into resource limits. In ELT, the work is done in the cloud and all the tools are available there.

Time to look at a new toolset.

Your new favorite data engineering tool — Databricks

Databricks is a product created by the team that created Apache Spark. On the Microsoft Azure platform it hides all the complex work required to create clusters of multiple machines with distributed data and queries. It provides a unified processing platform for large amounts of data in a performant and scalable manner.

But the killer feature for data engineering is the support for multiple languages and data pipelines. You can use SQL, Python, or Scala all in the same process. It can also support streaming and graph data and comes with connectors for many different sources.

Because we now have multiple types of data, in all kinds of formats, we need a toolset that encapsulates all those different needs. You use the appropriate language and its features for the appropriate task. If it is manipulating relational data, you use SQL. If you need to do some JSON parsing or string manipulation, you might use Python or something requiring object-oriented support like Scala.

The Databricks framework allows you create code using any of the above languages, as well as others, in the same process. This is something we have never been able to do before.

We do many modern data analytics projects with Azure Data Warehouse where enterprise data warehouses are being built out of data from many data formats, not just relational databases. Before we can get the data into Azure Data Warehouse, there is a lot of processing that needs to happen — especially around ensuring the data is correctly delimited, has no in-text line feeds, and no other common data migration problems. Databricks transforms can be built using Python (for non-set-based string parsing) and SQL for relational queries inside the same piece of transform code, making it your Swiss Army data engineering tool.

As you can see, although data engineering is not new, it has become more complex and includes non-relational data, requiring we add non-relation tools to our toolset. It has also become the critical first step for enterprise data warehouse and many machine learning and AI activities.

Because of all this, Databricks has become my new favorite data engineering tool.

For more information on Azure Databricks see: Azure Databricks Documentation

--

--

Capax Global
Hitachi Solutions Braintrust

We help advance your business by making the best use of information you already have, building custom solutions that align to your business goals!