Intelligent Integrated Data Platforms Series: Databricks Intro

Published in

Blue Orange Digital

9 min readMay 16, 2024

Databricks is an all-in-one data platform built around the Apache Spark engine, specifically designed for large-scale data handling. It goes beyond just being a processing tool. Databricks qualifies as an intelligent integrated data platform because it streamlines the entire data lifecycle, from ingesting information from various sources to analyzing it and deploying machine learning models. This unified environment allows data engineers, data scientists, and business analysts to work together seamlessly on a single platform.

The benefits of using Databricks are numerous. It simplifies complex data tasks with a user-friendly interface and fosters team collaboration. Databricks also boasts scalability, allowing businesses to handle massive datasets efficiently. Furthermore, it leverages the power of Apache Spark for faster processing and real-time analytics. With its secure and reliable architecture, Databricks empowers organizations to unlock the true potential of their data.

Origins of Databricks

Hadoop and HDFS are foundational technologies that underpin Databricks’ capabilities. Apache Hadoop is an open-source framework for managing big data. It excels at distributing large datasets across clusters of commodity hardware, enabling parallel processing for faster analysis. HDFS, the Hadoop Distributed File System, acts as the storage layer for Hadoop. It efficiently stores massive amounts of data across multiple machines, ensuring reliability through data replication in case of hardware failures.

Databricks integrates seamlessly with Hadoop and HDFS. It can directly access data stored in HDFS, allowing users to leverage existing data infrastructure. This eliminates the need to move data around, streamlining workflows.

Databricks also works with Apache Spark, another open-source framework built on top of Hadoop. Spark excels at fast, in-memory data processing, making it ideal for complex data analytics tasks on HDFS data. This combined approach of HDFS storage, Spark processing, and Databricks’ user-friendly environment empowers businesses to analyze big data efficiently and gain valuable insights.

One final important concept supported by Databricks is Hive, which acts as a data warehouse built on top of Hadoop. It provides a SQL-like interface for querying data stored in HDFS. This is particularly helpful for data analysts and business users who may not be familiar with programming languages like Java or Python, commonly used for manipulating data in Hadoop. Databricks integrates with Hive, allowing users to leverage familiar SQL syntax to query massive datasets stored in HDFS. This simplifies data exploration and analysis within the Databricks environment.

Lastly, Databricks shines in data analysis due to its comprehensive toolkit. It allows for data wrangling (which converts raw data into more useful formats for analysis) and exploration through notebooks that support SQL and popular data science languages like Python and R. These notebooks can be used for data cleaning, transformation, and visualization. This combination of tools empowers businesses to uncover valuable insights from their data, regardless of technical expertise. It also offers built-in machine-learning libraries and tools that allow data scientists to develop, train, and deploy ML models directly within the platform. This eliminates the need to switch between separate tools for data processing and model building. Furthermore, Databricks integrates with popular open-source machine learning frameworks like TensorFlow and PyTorch, providing flexibility in model development.

Best Practices with a Practical Example

To show how well Databricks enables data engineering tasks, this next example will illustrate a standard ELT (Extract, Load, Transform) process.

Databricks ingests the raw CSV file from blob storage into the Staging Lakehouse area.
Within Databricks, data transformations are performed on the data in the Staging Lakehouse.
The processed data is stored as Delta tables in the Analytical Delta Tables area.
Analytical visualizations are created based on the data stored in the Analytical Delta Tables.

Let’s start by creating a cluster of machine(s) with which Databricks will process the underlying data. Once the cluster is up and running, let’s create a notebook that we will use to code our data transformation and analysis scripts.

Next, we can check any existing databases with the following SQL command. Make sure that the cell on the notebook that you are executing is set as an SQL one.

SHOW DATABASES

At this point, there is only the default database. Let’s create a new one named db.

CREATE DATABASE IF NOT EXISTS db

After that, we can activate the just-created database by commanding:

USE db

Row-based vs Columnar data format

The idea behind the Lakehouse paradigm adopted by Databricks is that the underlying data can be multiple types of files, creating an abstraction to access it as if it were a regular Warehouse with tables and schema. In that sense, Databricks can work with a storage system and file types that are more optimized for transitional operational data and with analytical data that should be optimized for querying.

Row-based data formats

Ingestion from operational data sources requires row-based formats. These formats prioritize fast writing and appending data, making them suitable for scenarios where data is constantly being updated. Some standard row-based formats in Databricks includes: CSV, a simple and human-readable format; and JSON, which offers a structured way to store data using key-value pairs.

With that in mind, we can create a table pointing to the row-based ingestion file. For those familiar with Snowflake, it works similarly as a staging area from this other product. This creates a table abstraction that points to this file without creating a physical table on Databricks. As follows, observe that you don’t have to inform the table schema as the platform infers it from the file structure.

CREATE TABLE cloud_services_staging
USING CSV
OPTIONS (
path “/FileStore/tables/cloud_service.csv”,
header “true”,
inferSchema “true”,
delimiter “,”
);

It can be read using SQL statements or Python-Spark code as follows:

df = spark.table(“cloud_services_staging”)
display(df)

The PySpark library also supports Hive SQL-like querying, like this

df = spark.sql('''
SELECT * FROM cloud_services_staging
''')
display(df)

Columnar-based formats

Columnar-based data formats excel at handling large datasets for analytical workloads. They store data in columns rather than rows, making retrieving specific columns needed for analysis faster. Common columnar formats in Databricks includes: Parquet, a widely adopted format known for its efficiency and compression capabilities, making it ideal for complex data exploration and visualization tasks; and ORC offers similar benefits to Parquet in terms of data compression and fast querying; lastly and probably most importantly, the Delta format, which provides the benefits of a columnar format for efficient querying, but with additional features optimized for data management and operational use cases. Databricks seamlessly reads and writes data in these formats, allowing you to leverage existing data stored in cloud storage or HDFS for efficient analysis.

Separately from the ingestion staging tables, we can create a physical analytical table using the Delta format. For that, we have to inform the table schema of the Delta storage location when we create the table.

CREATE TABLE IF NOT EXISTS cloud_services_analyticals (
ID int,
ServiceName string,
Vendor string,
ReleaseDate date,
DaysRelease long
)
USING DELTA LOCATION “/FileStore/tables/delta/cloud_services_analyticals”

Data Transformations

The possible transformations that can be done in Databricks are infinite. In our example here, we will do something as simple as subtracting the number of days between today and the release date of a cloud service registered in our operational table.

This can be done using SQL syntax or a programming language code like Python or Scala. In all the cases, Databricks’ structure enables optimal execution performance and parallel processing . The SQL option can be seen here:

INSERT INTO cloud_services_analyticals
SELECT *,datediff(current_date(), ReleaseDate) AS DaysRelease FROM cloud_services_staging

And the PySpark version is next:

from pyspark.sql.functions import datediff, current_date
stg = spark.table(“cloud_services_staging”)
stg=df.withColumn(“DaysRelease”, datediff(current_date(), df[“ReleaseDate”]))
#display(stg)
#stg.printSchema()
stg.write.format(“delta”).mode(“append”).saveAsTable(“cloud_services_analyticals_”)

The PySpark code above actually uses the standard dataframe library that is implemented by Spark. Its syntax is somewhat different from the pandas dataframes that many programmers are used to. So Databricks released a pandas-like API, originally named Kolas, later adopted by Spark, the standard pandas-scala interface. Even though its syntax is very similar to regular Pandas, behind the scenes, the API converts it into regular Spark code and, more importantly, leverages parallel processing power. See the same example, but now with the pandas API syntax below:

import pyspark.pandas as ks
from datetime import datetime
stg = ks.read_table(“cloud_services_staging_”)
stg[“DaysRelease”]=(stg[“ReleaseDate”] — stg[“ReleaseDate”]).astype(‘short’)
#stg.info()
stg.to_delta(“/FileStore/tables/delta/cloud_services_analyticals_”, mode=”append”)
display(stg)

Data Analysis and Visualization

Raw data often contains a lot of noise that can obscure the bigger picture. Consolidation and summarization help remove this noise by focusing on the most relevant information. This empowers you to delve deeper into the data and uncover the hidden insights that drive meaningful conclusions. These two techniques are crucial tools in data analysis, acting like a magnifying glass to focus on key details. Large datasets can be overwhelming. Consolidation merges similar data points, while summarization condenses information into key metrics. This simplifies analysis by presenting a clearer picture, allowing you to identify patterns and trends quickly. With this in mind, we can recommend consolidation when you have redundant data points, like multiple entries for the same customer. And summarization when you need to understand overall trends, like average sales figures across different regions. By strategically applying these techniques throughout the analysis process, you can transform overwhelming data into a clear and actionable story.

To illustrate this, consider some simple, like counting how many distinct services each cloud vendor offers:

ana = ks.sql('''
select distinct Vendor,ServiceName from cloud_services_analyticals
''')
agg=ana.groupby(‘Vendor’).count()[‘ServiceName’]
print(agg)

With Databricks and our summarized data in hand, we can better understand it with suitable visualization graphs. For the scenario where we are comparing the total quantity of services by vendor, a pie chart should be ideal, which can be achieved in just one line of code, like:

agg.plot(kind=’pie’)

Databricks empowers both programmatic and visual data enthusiasts. Programmers can leverage familiar languages like Python and R to create highly customized data visualizations using libraries like Matplotlib or Seaborn. This granular control allows for tailored charts and graphs that suit the specific analysis perfectly. However, Databricks also recognizes the need for user-friendly insights communication. That’s where the dashboard feature comes in. It allows users to create interactive dashboards that combine charts, graphs, and text elements into a single, shareable interface. Business users who may need to become more familiar with coding can easily understand trends and patterns within the data. This makes Databricks dashboards ideal for scenarios like communicating sales performance to executives, tracking marketing campaign effectiveness, or providing customer support representatives with real-time customer insights — all within a user-friendly and visually compelling format. To create a dashboard in Databricks using our data, just click, File -> New dashboard, to diagram something like this:

Conclusion

Databricks is a powerful and versatile data platform that can help businesses of all sizes manage and analyze their data more efficiently. It is built on top of Apache Spark, a fast and scalable processing engine, and supports various data formats, including CSV, JSON, Parquet, and ORC. Databricks also integrates with popular open-source machine learning frameworks. This makes it an excellent choice for businesses who want to build and deploy machine learning models on their data. In addition to its powerful data processing and analytics capabilities, Databricks also offers a user-friendly interface that makes it easy for users of all levels of technical expertise to get started. Overall, Databricks is a comprehensive and robust data platform that can help businesses of all sizes unlock their data’s value.