Introduction to Databricks: A Beginner’s Guide

Mariusz Kujawski
19 min readFeb 26, 2024

--

In this guide, I’ll walk you through everything you need to know to get started with Databricks, a powerful platform for data engineering, data science, and machine learning. From setting up your environment to understanding key features like data processing and orchestration, this guide has got you covered. Let’s jump right in!

Databricks Environment

In Databricks environments, we have four major components:

  • Workspace: A Databricks deployment in the cloud that functions as an environment for your Databricks assets. Here, you can create notebooks, jobs, clusters, and manage access. You can create multiple workspaces for your teams.
  • Notebooks: Web-based interfaces where you can write code in Python, Scala, or R, execute it, check its results, and visualize it.
  • Cluster: Your compute resources and configuration on which you can run your notebooks and jobs. You can create different types of clusters depending on your needs, selecting a version of Databricks runtime, memory, CPU for driver and nodes, etc. Monitor your code and job executions. This is also where you can install the libraries required by your code.
  • Metastore: A place where you can find metadata information about tables that were created. You can find all databases and tables here. Every Databricks deployment has a central Hive metastore accessible by all clusters for table metadata.

Apache Spark

Apache Spark is an open-source, unified computing engine and a set of libraries for parallel data processing on computer clusters. Spark supports multiple programming languages (Python, Java, Scala, and R) and includes libraries for diverse tasks ranging from SQL to streaming and machine learning. You can run Spark on your local machine or scale up to thousands of computers, making it suitable for big data processing.

Below is an example of Spark Application architecture. It’s crucial to notice that operations on data are split between node workers and executors to speed up our data processing. In a Spark cluster, each node is a separate machine. For test purposes or small datasets, we can use the single-node option with one machine.

Spark Application Architecture

DataFrame

A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as structured data files, tables in Hive, external databases, or existing RDDs.

Lazy Evaluation

Spark’s lazy evaluation comes into play at this point. The logical execution plan is not immediately executed, and Spark defers the computation until an action is called.

Transformations

Transformations are the instructions you use to modify the DataFrame in the way you want and are lazily executed. There are two types of transformations: narrow and wide.

  • Narrow Transformations: These are transformations for which each input partition will contribute to only one output partition. Examples include map, filter, and select.
  • Wide Transformations: These transformations will have input partitions contributing to many output partitions. Spark will perform a data shuffle operation, redistributing partitions across the cluster. Examples include group by and join.

Actions

Actions are operations that trigger the data processing and return results or write data to storage. Examples of actions include count, collect, write, show, etc. When you call an action, Spark evaluates the entire logical execution plan built through transformations and optimizes the execution plan before executing it.

Databricks Clusters

To create a cluster in Databricks workspace we need to go to the compute tab. We can select a few types of clusters with different configurations and access modes. Each configuration will have its benefits and limitations, I’ll discuss them briefly below.

Cluster Types

All Purpose Cluster

  • All Purpose Cluster: These clusters are primarily used for interactive data analysis using Databricks notebooks. Multiple users can share these clusters for collaborative interactive analysis. Clusters without any activity are terminated after the specified time in the “terminate” field of the cluster configuration.
  • Job Cluster: Job Clusters in Databricks are transient clusters specifically created for running jobs. Running non-interactive workloads is preferred for cost efficiency, as the compute is only used for the duration of the job, after which it will be terminated.
  • SQL Warehouse: This type of cluster is for executing SQL queries against tables. Access is provided through a SQL editor, and there are predefined clusters available for configuration.

Access Mode

Cluster access mode is a security feature that determines who can use a cluster and what data they can access via the cluster. When creating any cluster in Azure Databricks, you must select an access mode. Considerations include how you want to use a cluster, supported languages, whether you need mounts, or Unity Catalog integration, etc.

  • Single-user Mode: Supports Unity Catalog and is assigned and used by only a single user. Supports Scala, Python, SQL, and R, as well as RDD API, DBFS mounts, init scripts, and external libraries. Databricks Runtime ML is also supported.
  • Shared Mode: Can be used by multiple users with data isolation among them, requiring a premium plan. Supports Unity Catalog and SQL, Python, and Scala, but does not support RDD API and DBFS mounts. When used with credential pass-through, Unity Catalog features are disabled. Databricks Runtime ML is not supported.
  • No Isolation Shared Mode: This can be used by multiple users with no data isolation among them. Does not support Unity Catalog but can be used with SQL, Scala, Python, and R. Supports RDD API, DBFS mounts, init scripts, libraries, and Databricks Runtime ML.

Basics of Data Processing

Databricks allows us to use Scala, Python, and Spark SQL. In this post, I’ll focus on Python and Spark SQL. PySpark is the Python API for Apache Spark, enabling real-time and large-scale data processing in a distributed environment. Spark SQL is Apache Spark’s module for working with structured data, seamlessly integrating SQL queries with Spark programs. With PySpark DataFrames, you can efficiently read, write, transform, and analyze data using Python and SQL. Whether you choose Python or SQL, the same underlying execution engine is used, ensuring that you can always leverage the full power of Spark.

Basic DataFrame Using Python

In Databricks, we can access the Spark session using the spark object.

data = [[1, "VW"],  
[2, "BMW"]]

columns = ["ID", "Car"]

dataframe = spark.createDataFrame(data, columns)

# show data frame
display(dataframe)

Reading and Writing from and to Azure Data Lake Gen2

To start working with Databricks, we need to configure external storage. In my case, I’ll be using Azure Account Storage. This storage will be used for reading and writing data. While we can create tables in Databricks, it’s important to note that we still operate on files, making storage a crucial aspect of our setup. Databricks supports several options for integrating with storage, including mounting account storage in the Databricks file system dbfs or by using a URL. To connect to account storage, we need to create a service principal and assign it to the “Blob Storage Contributor Role.” Alternatively, we can use the access key. It’s considered a best practice to store sensitive information in a secret scope.

Notice: This method should be used for training purposes or specific use cases. If you use Unity Catalog (description provided in the next paragraph), it’s not a recommended way of accessing storage.

Mounting Storage:

storageAccount="account"
tenantID = ""
mountpoint = "/mnt/bronze"
storageEndPoint =f"abfss://bronze@{storageAccount}.dfs.core.windows.net/"

#ClientId, TenantId and Secret is for the Application(ADLSGen2App) was have created as part of this recipe
clientID ="xxx-xx-xxx-xxx"
tenantID ="xx-xx-xx-xxx"
clientSecret ="xxx-xx-xxxxx"
oauth2Endpoint = f"https://login.microsoftonline.com/{tenantID}/oauth2/token"


configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": clientID,
"fs.azure.account.oauth2.client.secret": clientSecret,
"fs.azure.account.oauth2.client.endpoint": oauth2Endpoint}


dbutils.fs.mount(
source = storageEndPoint,
mount_point = mountpoint,
extra_configs = configs)

To test the new month we can list objects from a directory:

display(dbutils.fs.ls("dbfs:/mnt/bronze/Customer/"))

Accessing files using a URI

clientID ="xxx-xx-xxx-xxx" 
tenantID ="xx-xx-xx-xxx"
clientSecret ="xxx-xx-xxxxx"
oauth2Endpoint = f"https://login.microsoftonline.com/{tenantID}/oauth2/token"

spark.conf.set("fs.azure.account.auth.type.cookbookadlsgen2storage.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.cookbookadlsgen2storage.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.cookbookadlsgen2storage.dfs.core.windows.net", clientID)
spark.conf.set("fs.azure.account.oauth2.client.secret.cookbookadlsgen2storage.dfs.core.windows.net", clientSecret)
spark.conf.set("fs.azure.account.oauth2.client.endpoint.cookbookadlsgen2storage.dfs.core.windows.net", oauth2Endpoint)


df_direct = (
spark.read
.format("csv")
.option("header",True)
.load("abfss://bronze@account.d.dfs.core.windows.net/Customer")
)

Reading and Saving Data from Various Sources

Spark enables us to effortlessly read and save data from various sources, including CSV files, Parquet, Avro, JSON, and more.

The read method in Spark allows us to import data from other databases such as Oracle, PostgreSQL, and more.

CSV

df = (
spark.read
.option("delimiter", ",")
.option("header", True)
.csv(pat_to_file)
)

SQL

SELECT * FROM read_files(
'/mnt/bronze/orders/orders.csv',
format => 'csv',
header => true,
mode => 'FAILFAST')

-- OR

SELECT *
FROM
csv.`/mnt/path/to/file`

While Spark can automatically detect schema for us, there is also an option to declare it manually.

from pyspark.sql.types import *

cust_schema = StructType([
StructField("id", IntegerType()),
StructField("car", StringType()),
])

df = (
spark.read
.option("delimiter", ",")
.option("header", True)
.option("cust_schema")
.csv(pat_to_file)
)

Oracle

query = "select 1 a from dueal"

df = spark.read \
.format("jdbc") \
.option("url", "jdbc:oracle:thin:user/pass@//address:port/instance") \
.option("query", query) \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
.load()

Saving data

We can save files directly to storage using the following code:

df.write
.format("csv")
.option("header", "true")
.save(path_to_save)

Databricks tables

When we have data loaded in our data frame, we can transform them using transformations, save the data frame on storage, or create a table in a schema(database). To save data in a table, we need to create a database, which we can do using Spark SQL:

Then, we can create and populate a table using PySpark.

df.write
.format("delta")
.mode("overwrite")
.option("overwriteSchema", "true")
.saveAsTable(table_name, path=destination)

As a result of execution, we should see a table in the Workspace in the Catalog tab.

Spark SQL provides the possibility to create a table using SQL DDL. Creating tables using SQL could be simple for people who have just switched from SQL Server, PostgreSQL, or Oracle. As mentioned earlier, tables in Databricks are only metadata descriptions of files stored in the table location. Optionally, we can specify the file format and its location; if we don’t, Databricks will use the default location and format. Using the default setting will result in the creation of a Spark-managed table.

-- Creates a Delta table

CREATE TABLE test_db.cars(
id int,
model STRING,
year int
)

-- Use data from another table
CREATE TABLE test_db.cars
AS
SELECT * from load_cars


-- Use data from another table using its path
CREATE TABLE IF NOT EXISTS test_db.cars
AS SELECT *
FROM
delta.`raw/cars`;

If we specify a location, it will result in the creation of a Spark unmanaged table(External Table).

CREATE TABLE test_db.cars(
id int,
model STRING,
year int
)
USING DELTA
LOCATION '/mnt/silver/cars';

The main difference between them is that Databricks only manages the metadata; when you drop a table, you do not remove the underlying data.

Read Databricks tables

With Spark and SQL, we can execute a query against a table existing in Databricks meta store to retrieve data from them.

df = spark.sql("""
select
*
from
test_db.cars""")
df = spark.read.table("test_db.cars")

Databricks notebooks allow us to query tables using Spark SQL directly. To use SQL, we need to switch the language to SQL or use the magic command %sql. The result will be displayed under the cell with the SQL command.

%sql

select
*
from
test_db.cars

Delta Tables

Delta Lake is the default storage format for all operations on Databricks. Unless otherwise specified, all tables on Databricks are Delta tables. Delta Lake serves as the optimized storage layer that forms the foundation for storing data and tables in the Databricks lakehouse. It is an open-source technology that extends Parquet data files with a file-based transaction log for ACID transactions and scalable metadata handling. Delta Lake seamlessly integrates with Apache Spark, offering a range of important features:

  • Additional Operations: Delta Lake provides additional operations such as inserts, updates, and deletes. You can update data in a target using the SQL Merge operation, improving load strategies for a lakehouse and supporting operations like Slowly Changing Dimensions (SCD).
  • ACID Transactions: Delta Lake ensures ACID (Atomicity, Consistency, Isolation, Durability) transactions, guaranteeing data integrity and reliability.
  • Time Travel: Delta Lake offers time travel capabilities by providing a transaction log, enabling users to access and revert to earlier versions of data, facilitating data auditing and versioning.
  • Schema Evolution: Delta Lake supports schema evolution, allowing changes to a table schema that can be applied automatically without the need for recreating a file, ensuring flexibility and ease of maintenance.
  • Optimization: Delta Lake provides optimization features such as OPTIMIZE and ZORDER BY to reduce the number of files and physically organize data within a file, improving query performance and efficiency.
  • Cache in Databricks: Delta Lake allows caching files on worker nodes and caching the result of queries in memory, reducing the need for reading from files and enhancing query performance.

After saving a table, you should observe the same structure reflected in storage.

DataFrame Transformations

Transformation is one of the most important aspects of data processing. Spark, in this context, is very powerful with its capabilities to integrate data from different sources and the flexibility to use Python, Scala, and SQL. In this paragraph, I’ll demonstrate how we can transform a DataFrame in the most common scenarios.

To start working with a DataFrame, we need to read data from a file, table, or we can use a list, as demonstrated in the example below to create a DataFrame.

data = [
[1,"Alicia Kaiser","WA","001-677-774-4370","13714","Stacy Summit","54799","Lake Joseborough","Delaware","1939-04-07","2656508165701512", "1000.00"],
[2,"Donna Ellis","BR","001-631-995-8008","43599","Adam Trail","07204","Port April","Montana","1966-07-28","5527331190171381", "1000.00"],
[3,"Kenneth Smith","WA","001-592-849-6009x4173","649","Sherri Grove","14527","North Miranda","Washington","1998-09-25","5366314062583069", "1000.00"],
[4,"Danny Clark","WA","574.419.0221x348","285","Timothy Drive","41106","West Erica","Maryland","1948-08-29","5488489084734990", "2000.00"],
[5,"Nicholas Thompson","CA","(259)268-8760x061","998","Russell Shoals","65647","South Todd","South Carolina","1985-03-08","2720282150775392", "3000.00"],
[6,"Frances Griffith","WA","7535316823","9559","Emily Branch","71422","Mcdanielhaven","New York","1984-07-12","2248334835679706", "4000.00"],
[7,"Trevor Harrington","CA","742.224.9375","5960","Lisa Port","73881","Loganbury","New York","1979-03-10","5130498353342015", "2000.00"],
[8,"Seth Mitchell","AA","(386)517-7589x04440","47352","Stafford Loop","01347","South Alexander","North Dakota","1944-07-03","2280935548220544", "3000.00"],
[9,"Patrick Caldwell","BR","001-307-225-9094","0170","Amanda Dam","24885","Port Mollyhaven","Connecticut","1973-05-04","5557276831443314", "2000.00"],
[10,"Laura Hopkins","CA","9095819755","143","Lee Brook","23623","Jarvisland","Hawaii","1971-11-13","2720224762678291", "100.00"]
]
schama = "client_number int,name string,branch string,phone_number string,bulding_number string,street_name string,postcode string,city string,state string,birth_date string,credit_card_number string, amout string"
df = spark.createDataFrame(data, schama)

The display command will present the content of the DataFrame.

Selecting columns

display(df.select("client_number","name", "amout"))

Spark allows us to use SQL to transform data. To utilize a DataFrame created or imported within a SQL context, we need to create a temporary view:

df.createTempView("client")

There are various options to access columns in a DataFrame. Below, you can find a few examples:

from pyspark.sql.functions import expr, col, column
display(df.select(
expr("client_number as client_num"),
col("client_number").alias("client_num"),
column("client_number"))\
.limit(2))

During data transformation, common scenarios include changing data types, renaming columns, adding new columns, and deriving new columns based on values from others. Let’s explore how we can achieve these tasks.

from pyspark.sql.functions import expr, col, column, lit, exp, current_date

age_exp = "extract( year from current_date) - extract( year from birth_date) "

df1 = df.withColumn("amout", col("amout").cast("decimal(10,2)")) \
.withColumn("birth_date", col("birth_date").cast("date")) \
.withColumn("age", expr(age_exp)) \
.withColumn("load_date", lit(current_date()))

df1 = df1.drop("credit_card_number", "birth_date")

display(df1)

In this example, we utilize the withColumn function to calculate age, add load date columns, and change the amount to decimal data type. Additionally, we can drop columns with unnecessary data. The same result can be achieved using SQL.

df.createOrReplaceTempView("client")

df1 = spark.sql("""
select
client_number,
name,
bulding_number,
street_name,
postcode,
city,
state,
cast(amout as decimal(10,2)) amount,
extract( year from current_date) - extract( year from birth_date),
current_date() load_date
from
client
""")
display(df1)

Filtering Rows in a DataFrame

Filtering rows in a DataFrame involves creating a condition that separates the data you want to keep from the data you don’t. This condition can be written as a simple expression or built from multiple comparisons. DataFrames offer two methods, where and filter, to achieve this filtering based on your chosen condition.

display(df1.where("age >= 85"))
display(df1.where(col("age") >= 85))

It’s possible to build more complex expressions in DataFrame filtering using AND or OR conditions, allowing for greater flexibility in specifying conditions for row selection.

display(df1.where("age >= 85 or age <=30"))
display(df1.filter( (col("age") >= 85) | (col("age") <= 30)))
display(df1.where("age >=30 and branch != 'WA' "))
display(df1.filter( (col("age") >= 30) & (col("branch") != 'WA')))

Grouping and Sorting

Group by is a transformation operation in PySpark used to group the data in a Spark DataFrame based on specified columns. This operation is often followed by aggregating functions such as count(), sum(), avg(), etc., allowing for the summarization of grouped data.

Order by, on the other hand, sorts records in a DataFrame based on specified sort conditions, allowing for the arrangement of data in ascending or descending order.

from pyspark.sql.functions import desc, count, sum

# count by brnahc
display(df1.groupBy("branch").count().orderBy(desc("count")))

# sum, count by branch with order by client_number count
display(df1.groupBy("branch").agg(
sum("amout").alias("amout"), count("client_number").alias("qt")
).orderBy(desc("qt")))

This SQL version mirrors the explanation provided for PySpark:

df.createOrReplaceTempView("client")
df1 = spark.sql("""
select
branch,
sum(amount) amount,
count(client_number) client_number

from
client
group by branch
""")

df1.show()

+------+------+-------------+
|branch| amout|client_number|
+------+------+-------------+
| WA|8000.0| 4|
| BR|3000.0| 2|
| CA|5100.0| 3|
| AA|3000.0| 1|
+------+------+-------------+

Alternatively, in PySpark, you can use the show command to display the results of these operations directly in the notebook or console.

Joining DataFrames

Joining operations are crucial for various data processing tasks such as data normalization, data modeling, and ensuring data quality. Spark supports joins using DataFrame join and SQL joins.

To demonstrate the join operation, we need an additional DataFrame. I’ll create it similarly so you can easily replicate my steps, or you can load data from a file or table.

trans = [
[str(uuid.uuid4()), 1, 100.00, '2024-02-01'],
[str(uuid.uuid4()), 1, 200.00, '2024-02-03'],
[str(uuid.uuid4()), 1, 130.00, '2024-02-04'],
[str(uuid.uuid4()), 2, 110.00, '2024-02-05'],
[str(uuid.uuid4()), 3, 200.00, '2024-02-01'],
[str(uuid.uuid4()), 2, 300.00, '2024-02-02'],
[str(uuid.uuid4()), 2, 50.00, '2024-02-03'],

]

schama = "id string, client_id int, value double, tran_date string"
df_tran = spark.createDataFrame(trans, schama)
display(df_tran)

The code below illustrates how to join two DataFrames. Spark supports various join types, including:

  • Inner Join
  • Left / Left Outer Join
  • Right / Right Outer Join
  • Outer / Full Join
  • Cross Join
  • Left Anti Join
  • Left Semi Join
# column name index style
display(df.join(df_tran, df['client_number'] == df_tran['client_id'], 'inner'))

# column name property style
display(df.join(df_tran, df.client_number == df_tran.client_id, 'inner'))

These joins function similarly to their counterparts in SQL. Additionally, Spark provides a few new types of joins that you may not have seen in SQL:

- Left Anti join: Retrieves records from the left DataFrame that do not exist in the right DataFrame.

display(df.join(df_tran, df['client_number'] == df_tran['client_id'], 'left_anti').select("client_number"))
  • Left Semi-join: Retrieves records and columns from the left DataFrame where records match records in the right DataFrame.
display(df.join(df_tran, df['client_number'] == df_tran['client_id'], 'leftsemi'))

Naturally, Spark SQL allows us to use SQL-like joins in our code. We can join existing tables or create views and join these views using SQL syntax.

df.createOrReplaceTempView("clients")
df_tran.createOrReplaceTempView("tran")

display(spark.sql("""
select *
from
clients a
inner join
tran b on a.client_number = b.client_id
"""))

Union DataFrames

Spark facilitates union operations on DataFrames in several ways. We can union DataFrames using column names, as shown in example 1. Alternatively, we can union DataFrames without checking column names, as demonstrated in example 2. Additionally, Spark allows for merging two DataFrames while allowing missing columns.

df1 = spark.createDataFrame([[1, 2, 3]], ["col0", "col1", "col2"])

df2 = spark.createDataFrame([[4, 5, 6]], ["col1", "col2", "col0"])

# example 1
display(df1.unionByName(df2))

# example 2
display(df1.union(df2))

df2 = spark.createDataFrame([[4, 5, 6]], ["col1", "col2", "col3"])

# example 3
display(df1.unionByName(df2, allowMissingColumns=True))

When function

A useful function in PySpark is when, which is employed in cases where we need to translate or map a value to another value based on specified conditions.

from pyspark.sql.functions import expr, col, column, lit, exp, current_date, when

data = [("Robert", "Smith","M",40000),("Ana","Novak","M",60000),
("Carl","Xyz",None,500000),("Maria","Free","F",500000),
("Johm","Az","",None), ("Steve","Smith","",1000)]

columns = ["name","surname","gender","salary"]
df = spark.createDataFrame(data = data, schema = columns)

df2 = df.withColumn("new_gender", when(df.gender == "M","Male")
.when(df.gender == "F","Female")
.when(df.gender.isNull() ,"")
.otherwise(df.gender))

Databricks Auto loader

Auto Loader is a feature that enables the incremental and efficient processing of new files as they arrive in cloud storage. It can load data files from Azure Account Storage, AWS S3, and Google GCS.

Data live tables are a great extension for Auto Loader to implement ETL (Extract, Transform, Load) processes. Auto Loader automatically stores information about processed files, eliminating the need for additional maintenance steps. In case of failure, it will resume processing from the last successful step.

For instructions on how to configure Auto Loader, you can refer to another post.

Python Functions and Modules

In Databricks, we can organize our code using either notebooks or Python modules. When using notebooks, the process is straightforward. We create a notebook, for example, named “Utils,” where we define all the functions that we commonly use in our solution. Then, we can call the %run command to include the defined functions in other notebooks.

%run fun

print(add(1,2))

Alternatively, we can use a module by creating a folder named “utils” with the following structure:

.
├── utils
│ ├── __init__.py
│ └── fun.py
└── test_notebook
# fun.py
def add(a,b):
return a+b

We can import this module using well-known Python syntax:

from utils.fun import add


print(add(1,2))

You can install custom .whl files onto a cluster and then import them into a notebook. For code that is frequently updated, this process might be inconvenient and error-prone.

User-defined functions

If we can’t find a function for our use case, it’s possible to create our custom function. However, to use it with DataFrames or SQL, we need to register it.

from pyspark.sql.types import LongType
def squared(s):
return s * s

spark.udf.register("squaredWithPython", squared_typed, LongType())
df1.createOrReplaceTempView("df1")

SQL

%sql
select id, squaredWithPython(col1) as id_squared from df1

PySpark

from pyspark.sql.functions import udf

@udf("long")
def squared_udf(s):
return s * s

df1.createOrReplaceTempView("df1")
display(df.select("col1", squared_udf("col1").alias("col1_squared")))

Docstring and Typing

Using docstrings and typing in Python is crucial for well-documented code. Docstrings and typing provide the ability to document our classes, functions, etc., improving the readability and usability of the code. Information provided in docstrings can be utilized by code intelligence tools, the help() function, or accessed via the __doc__ attribute.

def exmplae(param0: type, param1: type)-> type:
"""
Does what

Args:
param0:
param1:

Returns:
Describe the return value

For example, let’s create a function that returns a date based on the number of days from 1900–01–01. To ensure clarity and correctness, we can write the function with a docstring and typing as shown above.

def to_date(number_of_days: int) -> str:
"""
Converts days from 1900-01-01 to date.

Args:
number_of_days: numbers of days
Returns:
date: date in string format.

"""
date_ = datetime.date(1900, 1, 1)

end_date = date_ + datetime.timedelta(days=number_of_days)
return end_date.strftime("%Y-%m-%d")

To access the documentation of this function use the help() command or IntelliSense:

Unity Catalog

Databricks Unity Catalog is a unified governance solution that centralizes access control, auditing, lineage, and data discovery capabilities across Databricks workspaces. Key features of Unity Catalog include security, data discovery, lineage, and data sharing.

  • Security: Unity Catalog offers a single place to administer data access policies that apply across all workspaces.
  • Data Discovery: Unity Catalog lets you tag and document data assets and provides a search interface to help data consumers find data.
  • Data Lineage: Data lineage supports use cases such as tracking and monitoring jobs, debugging failures, understanding complex workflows, and tracing transformation rules. It presents the flow of data in user-friendly diagrams.
  • Data Sharing: Unity Catalog gives businesses more control over how, why, and what data is being shared with whom.

The Unity Catalog Object Model

The Unity Catalog object model uses a tree-level namespace to address various types of data assets in the catalog. The top-level meta store acts as a global container for assets such as catalogs, followed by schemas for all your data entities like tables, views, models, and functions.

Interesting Functions in Preview

In the current Public Preview, several noteworthy features are worth mentioning:

  • Volumes: Provide access to non-tabular data stored in cloud object storage.
  • Lakehouse Federation: Provides direct access to relational database engines such as MySQL, PostgreSQL, Amazon Redshift, Snowflake, Microsoft SQL Server, Azure Synapse (SQL Data Warehouse), and Google BigQuery via Databricks.
  • Declaration of Primary Key and Foreign Key Relationships.

Orchestration

For process orchestration, several tools are available, including Azure Data Factory, Synapse Pipelines, Databricks Workflows, and Airflow. If you are working with Azure services, it is recommended to use Azure Data Factory or Synapse

Azure Synapse and Data Factory

Azure Data Factory or Synapse pipelines allow you to incorporate Notebook activities, configure parameters, and establish dependencies between other activities. Below, you can see an example of how to configure a pipeline. It requires a linked service configuration for the Databricks cluster. While I won’t delve into detailed explanations here, you can find more information in the documentation.

In the settings tab, you need to provide a path to the notebook and parameters as shown in the screenshot below.

To test it, you can click the “Debug” button and check the execution progress.

Databricks Workflows

Databricks Workspace provides “Workflows” functionality supporting job orchestration and scheduling. If you prefer to work exclusively within the Databricks environment or with cloud providers such as AWS or GCP, this option is suitable. Workflows help create tasks and orchestrate steps in data processing processes. For detailed configuration instructions, refer to the documentation here.

Summary

In this post, I have compiled the most important information required to start working with Databricks. While I have covered key aspects such as environment setup, data processing, orchestration, and more, it’s important to note that Databricks supports various powerful features beyond the scope of this post. These include data live tables, MLflow, streaming processing, maintenance, tuning, and monitoring.

I strongly recommend familiarizing yourself with these topics or feel free to let me know in the comments which ones you would like me to cover in my next post. Your feedback is valuable in shaping future content.

If you found this article insightful, I invite you to express your appreciation by liking it on LinkedIn and clicking the ‘clap’ button. Your support is greatly valued. For any questions or advice, feel free to reach out to me on LinkedIn.

--

--