Databricks for Data Warehouse Architects

James Dey
10 min readOct 16, 2023

--

I’ve been on a delivery project for last 5 months and work as a data architect. A lot of material about Databricks is pitched at data engineers, so I thought I’d add some information that would be more relevant to Data Architects.

https://www.linkedin.com/in/dataarchitectuk/

What is Databricks?

Databricks evolved from work that was done for Apache Spark

Nowadays, however, you can query tables in Databricks without realising that under the hood, those tables are based on files that sit within the Azure Delta Lake.

The system architecture provided by Microsoft shows an example of how data can be ingested into the Azure Delta Lake (via Azure Data Factory) or Databricks can pull data directly from Azure Event Hubs (or Confluent Kafka). From a data visualisation perspective, Power BI can pull data from Databricks via a connector.

Databricks in the context of an Azure data analytics stack

What is the Azure Delta Lake?

It’s a data lake with a logical filesystem (dbfs:/) that stores files in a folder like structure.

From a Databricks workspace, when you create in the SQL editor:-

  • A schema, then it automatically creates a directory dbfs:/user/hive/warehouse/<schema>.db
  • A table, then it automatically creates a directory dbfs:/user/hive/warehouse/<schema>.db/<table_name>. Within that directory, there are files that contain data partitions and a log file that tracks changes. Each partitioned file has self-contained indexes which allow search data to be pinpointed in files quickly.

Why is Databricks any different from the more traditional relational data warehouses e.g. Oracle, SQL*Server, Postgres, MySQL?

Databricks is a modern cloud based data lake (Azure Delta Lake) with a SQL engine (Databricks SQL) over it. Within a Notebook, you interact with the delta lake using Spark SQL (or PySpark) which in turn utilises Apache Hive.

It’s performance is phenomenal due to:-

i. It’s storage being in a data lake and files being partitioned with highly scalable read/write capability

ii. The ability to separately scale out the compute engine.

Unlike traditional data warehouses, you can do practically everything from a web browser, rather than having to download lots of desktop tools.

How does Databricks store table metadata?

Databricks stores table metadata either in the Hive metastore which is created per workspace, or more recently you have the option of storing metadata in the Unity Catalog (which can store metadata for multiple workspaces).

Are there data modelling tools for Databricks?

Yes. sqldbm.com will allow you to physically model metadata securely in a web browser and forward/reverse engineer to/from databricks, with version control in GitHub. This makes it much easier to work collaboratively with data engineering.

If your company requires that you use more traditional data modeling tools, the best way that I’ve found is to store all of your metadata in an Excel data dictionary. My recommendation is based on the following:-

i) Business users want to see data dictionaries and they’re very comfortable with Excel

ii) Excel is easy to share with users and you can secure access

iii) It’s easy to view and adjust datatypes, primary/foreign keys and column definitions in Excel to ensure that they are standardised.

iv) To produce entity relationship diagrams, it’s fairly easy to then import into legacy data modelling tools such as Erwin or PowerDesigner (they simply require Excel to be in a specific format).

v) To produce initial DDL, it’s relatively straightforward to read the metadata from Excel and generate the DDL

vi) To produce alter schema DDL, you can just create a schema in Databricks with your original schema definitions and another schema with your new schema definitions and again write some python code to generate the differences. Again this is quite straightforward to do if you have reasonable python coding skills. If you don’t, reach out to me and I can assist.

vii) Having a method that is data modelling tool agnostic means that you’re not tied to a particular data modelling tool and can be productive quickly. As a contractor, this is useful.

How do you create DDL in Databricks?

You can run all of your ANSI SQL CRUD commands from the SQL Editor. You will find “SQL Editor” in the SQL section of your databricks workspace

Databricks SQL Editor

How do you import data into Databricks?

There are a number of methods:-

  1. The simplest way is to upload a CSV file.

i) From your Databrick workspace,click on “Catalog”

ii) Then click on “+ Add” button in top right

iii) Then “Add data”

iv) If you then click on “Create or modify table” then you can simply drag & drop a CSV file and follow the instructions and Databricks will add the file to dbfs: and create a table definition for you.

2. As above but use 1 of the Native integrations, Fivetran data sources or sources from Data Marketplace

3. You can also download the databricks CLI and upload a file into your desired location in dbfs and then create a table on top of it. This method is more useful if you have lots of files to import.

Does Databricks only allow SQL?

No. You can interact with the Databricks platform using Files/Notebooks written in SQL, Python, Scala or R.

Note: The nice thing about Databricks Notebooks is that, unlike an IDE, you don’t need to supply connection information to be able to interact with the data held within the same databricks cluster as your notebook.

To write a Notebook:-

1. From the Databricks workspace, click on Workspace>Users

2. Click on your username

3. Click on “Add” and then Notebook

4. Then to the right of the notebook title, there is a dropdown list with “Python” as the default. Click on it to select SQL, Scalar or R

Can you schedule jobs in Databricks?

Yes. If you go to any Notebook, then the “Schedule” button is at the top right. You can also specify an email address for situations when the job fails to complete successfully. This will automatically create a job in a Databricks workflow. A databricks workflow is used to orchestrate jobs which in turn can have multiple tasks and dependencies associated with them. You can view the workflows by going to Workflows from your databricks workspace. In there you can also see whether your job runs failed or not.

Can you version control the notebooks?

Yes. Databricks has it’s own repo, which can be standalone or cloned from a Github repo and synched accordingly.

To create a Databricks repo:-

1. From the Databricks workspace, click on Workspace>Repos

2. Click on Add>Folder

3. Click on Add Repo

By default, the expectation is that you would want to create a Databricks Repo by cloning a Github repo, but if you untick the “Create repo by cloning a Git repository” then you can create a standalone Databricks Repo.

What is neat about this is that you can run and schedule code within the Databricks Repo

If you have cloned a Github Repo, then to push commits back to it or to pull new updates, then you simply right click on the Repo and choose “Git”.

Is there a simpler way to pipeline data in Databricks than using Notebooks & Workflows?

Databricks has you covered there as well. Delta Live Tables (DLT) despite it’s name is really a combination of a table and a pipeline. DLT caters for both batch loaded and streamed data. You can add data quality constraints to the table definition and decide what happens if the constraint is violated. Similar to dbt, you can have 1 DLT fed from 1 or more DLTs which will allow you to move your raw source data (bronze) to the integrated and cleaned layer (silver) and on to user facing tables (gold)

Does Databricks have any Data Visualisation capability?

Yes, within a notebook or SQL Editor, whenever you create a resultset, you can click on the “+” arrow next to the resultset and select a “Visualisation”. You then get to choose whether you want to display it as a table, bar, area, line, scatter, pie chart etc.

You can also create a dashboard by:-

  1. From your databricks project homepage, select “Dashboards” in the SQL section and then Lakeview dashboards.

You can then choose visualisations to add to the canvas and add datasets based on queries against your Databricks local data using the data tab.

Data and Canvas tabs in a Databricks Lakeview dashboard

How about Security?

Similar to the Metadata section, you are able to apply fine grained security on any object (e.g. table, Delta lake file, dashboard) in your individual workspace, or if you want to implement consistent security across all workspaces then you can use Unity Catalog. By default, Databricks is set up to talk to Azure Entra ID (formerly known as Active Directory). Other user/group directories can be added easily.

If you want to be able to mask sensitive data, Databricks provides Column Masks to prevent unauthorised users from viewing sensitive data such as bank account details

Column masks are expressions written as SQL UDFs. Each table column can optionally have one masking function applied to it. The masking function takes the unmasked value of the column as input and returns the masked value as its result.

The syntax for creating a function and applying it to a table is:-

CREATE FUNCTION <function_name> (<parameter_name> <parameter_type>, …)
RETURN {expression with the same type as the first parameter};

ALTER TABLE <table_name> ALTER COLUMN <col_name> SET MASK <mask_func_name> [USING COLUMNS <additional_columns>];

For example, the following will ensure that only a user that is in the ‘HumanResourceDept’ AD group can see social security numbers, whereas other users will see a masked version:-

CREATE FUNCTION ssn_mask(ssn STRING)
RETURN CASE WHEN is_member(‘HumanResourceDept’) THEN ssn ELSE ‘***-**-****’ END;

— Create the `users` table and apply the column mask in a single step:

CREATE TABLE users (
name STRING,
ssn STRING MASK ssn_mask);

Databricks also allows you to restrict access to data within a table using a similar approach

Row filters allow you to apply a filter to a table so that subsequent queries only return rows for which the filter predicate evaluates to true. A row filter is implemented as a SQL user-defined function (UDF).

To create a row filter, first write a SQL UDF to define the filter policy and then apply it to a table with an ALTER TABLE statement. Alternatively, you can specify a row filter for a table in the initial CREATE TABLE statement. Each table can have only one row filter. A row filter accepts zero or more input parameters where each input parameter binds to one column of the corresponding table.

The syntax for creating a function and applying it to a table is:-

CREATE FUNCTION <function_name> (<parameter_name> <parameter_type>, …)
RETURN {filter clause whose output must be a boolean};

ALTER TABLE <table_name> SET ROW FILTER <function_name> ON (<column_name>, …);

For example, the following will ensure that only a user that is in the ‘admin’ AD group can see all sales information, whereas other users will only be able to see US sales:-

CREATE FUNCTION us_filter(region STRING)
RETURN IF(IS_ACCOUNT_GROUP_MEMBER(‘admin’), true, region=’US’);

CREATE TABLE sales (region STRING, id INT);

ALTER TABLE sales SET ROW FILTER us_filter ON (region);

Ok but it’s not got anything for Data Quality, Right?

Wrong. You can add constraints to Delta Live Tables to achieve this.

The following Databricks SQL snippet provides an example as to how a Delta Live Table can be created and DQ rules applied to ensure that the data is valid, complete and unique:-

CREATE LIVE TABLE trades
USING delta
LOCATION ‘/mnt/delta/trades’
AS SELECT * FROM bronze.trades;

ALTER LIVE TABLE trades
ADD CONSTRAINT trade_id_is_not_null
EXPECT COLUMN trade_id IS NOT NULL;

ALTER LIVE TABLE trades
ADD CONSTRAINT trade_date_is_valid
EXPECT COLUMN trade_date >= ‘2000–01–01’;

ALTER LIVE TABLE trades
ADD CONSTRAINT trade_id_is_unique
EXPECT EVERY ROW IS UNIQUE(trade_id);

When a constraint is violated in Delta Live Tables, the system reacts based on the action defined for that constraint:

Expect: If the constraint is violated, the record is logged as a data quality issue, but it is still added to the target table and the pipeline continues to function as normal.

Expect or Drop: If a record violates this constraint, it is not added to the target table, a data quality issue is recorded by DLT, and the pipeline continues to function as normal. The rest of the rows satisfying the constraint will be successfully written to the Databricks Delta table.

Expect or Fail: If a record fails validation, the system throws an InvariantViolationException and the operation fails. Manual intervention is required before re-processing.

You can view data quality metrics such as the number of records that violate an expectation by querying the Delta Live Tables event log. This allows you to track and manage data quality issues effectively. Remember, the choice of action depends on your specific data quality requirements and tolerance for invalid data.

Competitors

At the moment, there is a fierce rivalry with Databricks which has almost identical functionality as Snowflake. Check out my other article Snowflake — the nuts & bolts in 2023 | by James Dey | Nov, 2023 | Medium to understand more

A key difference is between Databricks philosophy (of providing every feature itself) and Snowflake’s (encouraging a partner tool vendor ecosystem). There are pros & cons of both approaches. What I like about Snowflake’s approach is that you can pick and choose so can get ‘best of breed’.

The other emerging competitor is Microsoft Fabric which is integrating Azure Data & Analytics Tools with Databricks & Power BI

--

--