Star Schema Data Modeling Best Practices on Databricks SQL

Databricks SQL SME
DBSQL SME Engineering
11 min readJul 4, 2024
Output of a well-modeled schema!

Author

Shyam Rao, Sr. Specialist Solutions Architect @ Databricks

Introduction

The Star Schema is a widely-used database design in data warehousing where transactional data is extracted, transformed, and loaded into schemas consisting of a central fact table surrounded by conformed dimension tables.

Read on to understand how to optimally model your dimensions and facts using Databricks SQL. DBSQL is a powerful feature of the Databricks Data Intelligence platform designed to enable data analysts, data engineers, and business intelligence professionals to query and visualize data in a highly scalable and performant environment.

As a fun bonus, we will use the Databricks AI assistant at the end to generate an entire data model with these best practices and view the model in Unity Catalog. With AI being embedded into Databricks via AI/BI, Genie, and many of the platforms other capabilities, strong data modeling becomes even more critical to allow AI to understand the context of the data.

The better you model your data upfront, the more easily you can leverage AI on top of it out of the box.

Dimension tables

In the Star Schema architecture, dimensions provide context about the business entities or events being analyzed.

Characteristics of dimension tables

  • Descriptive Attributes: Contains descriptive data.
  • Hierarchical Structure: May include hierarchies for drilling down.
  • Unique Identifier: Typically a Surrogate Key.
  • Denormalized Design: Flattened for performance.
  • Static or Slowly Changing Attributes: SCD1 and SCD2 type dimensions.

Modeling dimension tables in Databricks SQL

Surrogate key:

  • Use GENERATED AS IDENTITY columns or hash values.
  • Prefer integer surrogate keys over strings for better performance.

Slowly Changing Dimension (SCD):

  • Delta Live Tables (AWS|Azure|GCP) is well suited for developing declarative ETL for dimensions, supporting both SCD1 and SCD2 types.

Extensibility:

  • Use the MAP data type to store custom or arbitrary key:value pairs. For e.g., User Defined Attributes: {“birthPlace” : “Northbridge, Massachusetts, US”}

Example dimension table definition:

CREATE TABLE g_patient_d (
patient_sk BIGINT GENERATED ALWAYS AS IDENTITY COMMENT 'Primary Key (ID)',
last_name STRING NOT NULL COMMENT 'Last name of the person',
gender STRUCT<cd:STRING, desc:STRING> COMMENT 'Patient gender',
birth_date TIMESTAMP COMMENT 'Birth date and time',
ethnicity STRUCT<cd:STRING, desc:STRING> COMMENT 'Code for ethnicity',
-- using array when it can map easily to source
languages_spoken ARRAY<STRUCT<cd:STRING, desc:STRING>> COMMENT 'Ordered list of known languages (first = preferred)',
patient_contact ARRAY<STRUCT<contact_info_type:STRING COMMENT 'Contact type (Phone/Email/Fax)', contact_info:STRING COMMENT 'Contact information (email address, phone number)', preferred_flag:BOOLEAN COMMENT 'preferred contact (Y/N)'>> COMMENT 'Contact information',
patient_mrn STRING COMMENT 'Patient medical record number',
other_identifiers MAP<STRING, STRING> COMMENT 'Identifier type (passport number, license number except mrn, ssn) and value',
uda MAP<STRING, STRING> COMMENT 'User Defined Attributes',
source_ref STRING COMMENT 'Unique reference to the source record',
effective_start_date TIMESTAMP COMMENT 'SCD2 effective start date for version',
effective_end_date TIMESTAMP COMMENT 'SCD2 effective start date for version',
g_process_id STRING COMMENT 'Process ID for record inserted',

CONSTRAINT g_patient_d_pk PRIMARY KEY(patient_sk)
)
COMMENT 'Patient dimension'

CLUSTER BY (last_name, gender.cd, birth_date, patient_mrn)

-- Optional helpful table properites
TBLPROPERTIES (delta.deletedFileRetentionDuration = 'interval 30 days',
'delta.columnMapping.mode' = 'name', -- Allows you to drop/move/rename columns by name
'delta.dataSkippingStatsColumns' = 'patient_sk,last_name,gender', -- Allows you to control the exact columns in which stats are calculated
'delta.enableChangeDataFeed' = false, -- Alows you to enable change data feed on the table
)
;

Table properties and optimizations:

  • Use managed Delta Lake tables to benefit from new features, better storage practices, and enhanced access control.
  • Use Liquid Clustering (AWS|Azure|GCP) based on frequently filtered attributes (up to 4 clustering keys)*.
  • Utilize predictive optimization (AWS, Azure) to maximize performance and minimize cost.
  • Define a PRIMARY KEY constraint on the surrogate key.
  • Ensure that column-level statistics (AWS|GCP|Azure) are collected to enable data skipping. Note: ARRAY and MAP data types lack column-level statistics. Avoid using these for frequently filtered columns. You can use the ‘delta.dataSkippingStatsColumns’ table property (shown above) to specify specific columns you want stats calculated on to avoid uneeded stats collection and increase performance.

Document tables and columns using COMMENTs and TAGS, to bake-in data governance into the development activities. Tags can be used to consistently characterize certain columns (for e.g., PII tag) or tables (for e.g., gold tag). Tags can also be used to define data access control.

A typical BI query joins a fact table with one or more dimension tables. The query aggregates one or more facts across dimension attributes. The dimensions may be filtered based on attribute values of interest. For e.g. -

SELECT
d.month,
p.category,
SUM(f.sales_amount) as total_sales
FROM sales_fact f
INNER JOIN dim_date d ON f.date_key = d.date_key
INNER JOIN dim_product p ON f.product_key = p.product_key
WHERE d.year = 2023 AND p.category IN ('Electronics', 'Computers')
GROUP BY d.month, p.category
ORDER BY total_sales DESC;

The process of evaluation would be: a) filter the dimension rows b) identify the corresponding dimension keys c) query the fact table with these keys.

Fact tables

Fact tables store quantitative, numeric data representing business transactions or events.

Characteristics of fact tables

  • Granular Data: Captures individual transactional or event-level records.
  • Foreign Keys: References dimension tables.
  • Aggregated Measures: May contain precomputed aggregates.
  • Large Volume of Data: Typically stores large datasets.

Modeling fact tables in Databricks SQL

Measures:

  • Use Numeric data types, avoiding floating-point numbers (use DECIMAL instead).

Example fact table definition:

CREATE TABLE g_claim_tf (
-- dimension FKs
patient_sk BIGINT NOT NULL COMMENT 'Foreign key to patient dimension',
...
facility_sk BIGINT NOT NULL COMMENT 'Foreign key to facility dimension',
-- measures
length_of_stay INT COMMENT 'Patient length of stay',
billed_amount DECIMAL(10, 2) COMMENT 'Claim billed amount',

claim_source_ref INT NOT NULL COMMENT 'Degenerate key for Claim Id from source',
g_insert_date TIMESTAMP COMMENT 'Date record inserted',
g_process_id STRING COMMENT 'Process ID for record inserted',

CONSTRAINT patient_d_fk FOREIGN KEY (patient_sk) REFERENCES g_patient_d(patient_sk),
CONSTRAINT facility_d_fk FOREIGN KEY (facility_sk) REFERENCES g_facility_d(facility_sk)
)
COMMENT 'Claims Transaction Fact'
CLUSTER BY (patient_sk, ..., facility_sk)
;

Table properties and optimizations:

  • Use managed Delta Lake tables.
  • Use Liquid Clustering (AWS|Azure|GCP) based on the dimension foreign keys (for frequently filtered and joined dimensions). This will aid the quick retrieval of the fact records for the specific dimension keys that are being joined.
  • Use predictive optimization (AWS|Azure) for performance and cost efficiency.
  • Define FOREIGN KEY constraints on the dimension key columns for query optimization.
  • ANALYZE TABLE table_name COMPUTE STATISTICS FOR COLUMNS <dimension keys>, for query planning and adaptive query execution.
  • Document tables and columns using COMMENTs and tags. This can super charge features such as AI/BI: Intelligent Analytics for Real-World Data.
  • Partitioning (or simply Liquid Clustering) can be beneficial only for very large tables (hundreds of terabytes).

Thinking About Updates and Versioning:

Transactional fact tables are typically not updated or versioned. Delta Lake’s Time Travel feature allows access to historical data, which can be useful for analyzing edge cases within the configured delta log retention period.

Presentation Layer

Views

The Star Schema is often created in the Gold Layer of the Medallion Lakehouse Architecture. Enhance your dash-boarding and reporting processes by creating and exposing views on top of dimension and fact tables.

Views can be used to:

  1. Tailor the presentation: Expose only essential attributes (for e.g., primary address).
  2. Simplify attribute representation: ARRAY and STRUCT (as well as VARIANT) types may not be compatible with all BI tools or easy to build analytics on top of.
  3. Create role-playing dimensions: For e.g., shipping facility vs. receiving facility.
  4. Use consistent formulas: Easily include new calculated measures based on user request.

Example dimension view definition:

CREATE VIEW g_patient_d_v (
patient_sk, last_name,
gender_code COMMENT 'Code for the patients gender', gender_desc COMMENT 'Patient gender',
date_of_birth COMMENT 'Date of birth',
spoken_language1 COMMENT 'Preferred language', spoken_language2 COMMENT 'Other spoken language',
preferred_contact_type, preferred_contact, patient_mrn, birth_place COMMENT 'Place of birth',
source_ref, effective_start_date, effective_end_date
)
AS
SELECT
patient_sk,
last_name,
gender.cd AS gender_code,
gender.desc AS gender_desc,
cast(birth_date AS DATE) AS date_of_birth,
try_element_at(languages_spoken, 1) AS spoken_language1,
try_element_at(languages_spoken, 2) AS spoken_language2,
filter(patient_contact, x -> x.preferred_flag)[0].contact_info_type AS preferred_contact_type,
filter(patient_contact, x -> x.preferred_flag)[0].contact_info AS preferred_contact,
patient_mrn,
try_element_at(uda, 'birthplace') AS birth_place,
source_ref,
effective_start_date,
effective_end_date
FROM g_patient_d
;

Using Databricks SQL Warehouse

Here are a few extra tips on leveraging the ecosystem around your data model to build a high caliber production system.

Accelerate ELT

Data profiling:

  • Use Lakehouse Federation (AWS|Azure|GCP) to connect to various data sources during the exploratory phase.

Data ingestion:

  1. COPY INTO command — Utilize the COPY INTO command (AWS|Azure|GCP) for batch file ingestion into Delta Lake. This command scalably ingests files with exactly once semantics, much like streaming tables in simple batch mode. It has tons of powerful options (such as file regex filtering for ingest safety, starting over option, and much more!)
  2. STREAMING Tables — Create streaming tables using the CREATE STREAMING TABLE command (AWS|Azure|GCP) for incremental data processing. The added benefit of streaming tables is that they can be easily configured to be batch, or real-time updating, and anything in between. This is real streaming from either files of Kafka/Kinesis/EventHub queues all in SQL! With streaming tables, you can define the update frequency in the table definition like so:
CREATE STREAMING TABLE firehose_bronze
SCHEDULE CRON '0 0 * * * ? *'
AS SELECT
from_json(raw_data, 'schema_string') data,
* EXCEPT (raw_data)
FROM STREAM firehose_raw;

Data quality:

Data Quality Expectations — Expectations allow you to define specific criteria or rules that the data must adhere to. You can leverage a few different types of data quality expectations on tables.

  1. EXPECT constraints on streaming tables — You can apply streaming table constraints that function just like Delta Live Tables Expectations. These are a big more dynamic and offer different action to take if data does not pass your rules such as WARN, DROP, or FAIL. You can define a streaming table expectation like this:
CREATE OR REFRESH STREAMING TABLE bronze_data (
CONSTRAINT date_parsing (to_date(dt) >= '2000-01-01') ON VIOLATION FAIL UPDATE
)
AS SELECT *
FROM STREAM read_files('s3://my-bucket/my_data');

2. CHECK Constraints — These are rules applied to data within any Delta table, not just streaming tables, so they are a more broad and generic data quality expectation table . You can add CHECK constraints to the table to ensure data quality and integrity like this:

-- Add Constraint to new table
CREATE TABLE persons(first_name STRING NOT NULL,
last_name STRING NOT NULL,
nickname STRING,
CONSTRAINT persons_pk PRIMARY KEY(first_name, last_name));

-- Add constraint to existing table
ALTER TABLE people10m ADD CONSTRAINT dateWithinRange
CHECK (birthDate > '1900-01-01');

When you add a constraint to a table, you can easily see the constraint in the table properties like so:

Data transformation:

  1. Streaming Tables + Materialized Views: Create materialized views for precomputing results based on the latest data in source tables.Materialized views provide a simple, declarative way to process data for compliance, corrections, aggregations, or general change data capture (CDC).
  2. Queries and SQL pipelines: This is the more imperative old-school approach, but often required for more complex ETL pipelines. Run SQL Notebooks or dbt tasks on your SQL Warehouse. You Use SQL tasks within a Workflows Job to run Databricks SQL Queries on your SQL Warehouse. Use features such as temporary views, MERGE, functions, SQL variables, EXECUTE IMMEDIATE, job parameters, sharing information between tasks, running tasks conditionally, etc. The Databricks SQL dialect is growing fast, so be on the lookout for much more functionality and blogs in SQL programmability in the near future!

Orchestration:

  • Use Databricks Workflows (AWS|Azure|GCP) for orchestrating your tasks and jobs.
  • You can refresh materialized views and streaming tables on a schedule defined as a part of the table DDL.

BI and Dashboarding:

  • Use Serverless DBSQLWarehouse for querying the star schema.
  • Benefit from Intelligent Workload Management (Serverless only), Photon Engine, Adaptive Query Execution, and Predictive Optimization.

Applied Modeling Example Generated by the Databricks AI Assistant

Lets take some of these concepts and build a really fast data model example to visualize in Unity Catalog. To build our example, we can just ask the Databricks AI assistant to generate a data model using some of our best practices with the following input prompt:

Write a SQL script to create a new schema called `main.model_tpch` and 
copy all tables from the `samples.tpch` schema into it.
Ensure that the tables are created only if they do not already exist.
Ensure to create proper primary and foreign keys based on the column name matches across tables.
To avoid creating foreign key relationships to tables before they are created, use the ALTER TABLE ADD CONSTRAINT command after all the table CREATE statements.
Remember that each ALTER TABLE ADD CONSTRAINT command must be a separate command for each constraint, even within the same table.
Optimize the tables by using `CLUSTER BY` on the primary and foreign keys for each table.
After creating the tables, insert data from the source tables into the new tables.

This will pull the sample data that comes with all Databricks deployments and use it to build a well-modeled version in Unity Catalog with all the constraints and CLUSTER KEYs defined for you. The AI assistant then provides us with the following output that we can run and visualize in Unity Catalog:


-- Step 1: Create the new schema
--DROP SCHEMA IF EXISTS main.model_tpch CASCADE;

CREATE SCHEMA IF NOT EXISTS main.model_tpch;

-- Step 2: Generate DDL statements with primary keys and CLUSTER BY
-- Customer table
CREATE TABLE IF NOT EXISTS main.model_tpch.customer (
c_custkey INT PRIMARY KEY,
c_name STRING,
c_address STRING,
c_nationkey INT,
c_phone STRING,
c_acctbal DECIMAL(18, 2),
c_mktsegment STRING,
c_comment STRING
)
CLUSTER BY (c_custkey);

-- Orders table
CREATE TABLE IF NOT EXISTS main.model_tpch.orders (
o_orderkey INT PRIMARY KEY,
o_custkey INT,
o_orderstatus STRING,
o_totalprice DECIMAL(18, 2),
o_orderdate DATE,
o_orderpriority STRING,
o_clerk STRING,
o_shippriority INT,
o_comment STRING
)
CLUSTER BY (o_orderkey, o_custkey);

-- Lineitem table
CREATE TABLE IF NOT EXISTS main.model_tpch.lineitem (
l_orderkey INT,
l_partkey INT,
l_suppkey INT,
l_linenumber INT,
l_quantity DECIMAL(18, 2),
l_extendedprice DECIMAL(18, 2),
l_discount DECIMAL(18, 2),
l_tax DECIMAL(18, 2),
l_returnflag STRING,
l_linestatus STRING,
l_shipdate DATE,
l_commitdate DATE,
l_receiptdate DATE,
l_shipinstruct STRING,
l_shipmode STRING,
l_comment STRING,
PRIMARY KEY (l_orderkey, l_linenumber)
)
CLUSTER BY (l_orderkey, l_linenumber);

-- Part table
CREATE TABLE IF NOT EXISTS main.model_tpch.part (
p_partkey INT PRIMARY KEY,
p_name STRING,
p_mfgr STRING,
p_brand STRING,
p_type STRING,
p_size INT,
p_container STRING,
p_retailprice DECIMAL(18, 2),
p_comment STRING
)
CLUSTER BY (p_partkey);

-- Partsupp table
CREATE TABLE IF NOT EXISTS main.model_tpch.partsupp (
ps_partkey INT,
ps_suppkey INT,
ps_availqty INT,
ps_supplycost DECIMAL(18, 2),
ps_comment STRING,
PRIMARY KEY (ps_partkey, ps_suppkey)
)
CLUSTER BY (ps_partkey, ps_suppkey);

-- Supplier table
CREATE TABLE IF NOT EXISTS main.model_tpch.supplier (
s_suppkey INT PRIMARY KEY,
s_name STRING,
s_address STRING,
s_nationkey INT,
s_phone STRING,
s_acctbal DECIMAL(18, 2),
s_comment STRING
)
CLUSTER BY (s_suppkey);

-- Nation table
CREATE TABLE IF NOT EXISTS main.model_tpch.nation (
n_nationkey INT PRIMARY KEY,
n_name STRING,
n_regionkey INT,
n_comment STRING
)
CLUSTER BY (n_nationkey);

-- Region table
CREATE TABLE IF NOT EXISTS main.model_tpch.region (
r_regionkey INT PRIMARY KEY,
r_name STRING,
r_comment STRING
)
CLUSTER BY (r_regionkey);

-- Step 3: Add foreign key constraints
ALTER TABLE main.model_tpch.customer
ADD CONSTRAINT fk_customer_nation FOREIGN KEY (c_nationkey) REFERENCES main.model_tpch.nation(n_nationkey);

ALTER TABLE main.model_tpch.orders
ADD CONSTRAINT fk_orders_customer FOREIGN KEY (o_custkey) REFERENCES main.model_tpch.customer(c_custkey);

ALTER TABLE main.model_tpch.lineitem
ADD CONSTRAINT fk_lineitem_orders FOREIGN KEY (l_orderkey) REFERENCES main.model_tpch.orders(o_orderkey);

ALTER TABLE main.model_tpch.lineitem
ADD CONSTRAINT fk_lineitem_part FOREIGN KEY (l_partkey) REFERENCES main.model_tpch.part(p_partkey);

ALTER TABLE main.model_tpch.lineitem
ADD CONSTRAINT fk_lineitem_supplier FOREIGN KEY (l_suppkey) REFERENCES main.model_tpch.supplier(s_suppkey);

ALTER TABLE main.model_tpch.partsupp
ADD CONSTRAINT fk_partsupp_part FOREIGN KEY (ps_partkey) REFERENCES main.model_tpch.part(p_partkey);

ALTER TABLE main.model_tpch.partsupp
ADD CONSTRAINT fk_partsupp_supplier FOREIGN KEY (ps_suppkey) REFERENCES main.model_tpch.supplier(s_suppkey);

ALTER TABLE main.model_tpch.supplier
ADD CONSTRAINT fk_supplier_nation FOREIGN KEY (s_nationkey) REFERENCES main.model_tpch.nation(n_nationkey);

ALTER TABLE main.model_tpch.nation
ADD CONSTRAINT fk_nation_region FOREIGN KEY (n_regionkey) REFERENCES main.model_tpch.region(r_regionkey);

-- Step 4: Insert data from source tables into the new tables
INSERT INTO main.model_tpch.customer
SELECT * FROM samples.tpch.customer;

INSERT INTO main.model_tpch.orders
SELECT * FROM samples.tpch.orders;

INSERT INTO main.model_tpch.lineitem
SELECT * FROM samples.tpch.lineitem;

INSERT INTO main.model_tpch.part
SELECT * FROM samples.tpch.part;

INSERT INTO main.model_tpch.partsupp
SELECT * FROM samples.tpch.partsupp;

INSERT INTO main.model_tpch.supplier
SELECT * FROM samples.tpch.supplier;

INSERT INTO main.model_tpch.nation
SELECT * FROM samples.tpch.nation;

INSERT INTO main.model_tpch.region
SELECT * FROM samples.tpch.region;

Now lets look at the Entity Relationship Diagram for our model! To do this, we click into our new schema, select a table, and click the “View Relationships” button:

Output of a well-modeled schema!

Conclusion

By following these guidelines, you can effectively model and optimize your Star Schema in Databricks SQL, leveraging advanced features and best practices to ensure high performance and scalability. This approach allows you to build a robust data warehouse that supports efficient querying and analytics, so Get Started!

--

--

Databricks SQL SME
DBSQL SME Engineering

One stop shop for all technical how-tos, demos, and best practices for building on Databricks SQL