Introduction to Databricks Lakehouse monitoring

6 min readDec 20, 2023

Databricks Lakehouse Monitoring, currently on preview, stands out as one of the tools organizations can benefit to incorporate statistics and quality metrics on top of their Unity Catalog tables without investing a huge amount of effort.

This article aims to demonstrate the application of this tool with one example, additionally, it will make use of dbldatagen to automate data generation that can be used to effectively test our pipelines.

Why use Databricks Lakehouse Monitoring?

According to Databricks:

What does data integrity look like, and how does it change over time? For example, what is the fraction of null or zero values in the current data, and has it increased?
What does the statistical distribution of the data look like, and how does it change over time? For example, what is the 90th percentile of a numerical column? Or, what is the distribution of values in a categorical column, and how does it differ from yesterday?
Is there drift between the current data and a known baseline, or between successive time windows of the data?
What does the statistical distribution or drift of a subset or slice of the data look like?
How are ML model inputs and predictions shifting over time?
How is model performance trending over time? Is model version A performing better than version B?

There are three types of profiles depending on what your use case is, in this post, I’m going to focus on the Snapshot one.

Requirements

The following are required to use Databricks Lakehouse Monitoring:

Your workspace must be enabled for Unity Catalog and you must have access to Databricks SQL.
Only Delta managed tables, external tables, views, and materialized views are supported for monitoring.

Setting up Environment and Generating Sample Data

First of all, we need to set up our catalog and two schemas that we will use for all our tests:

Bronze Schema: Where our tables will be created
Quality Schema: The location for the lakehouse monitoring tables.

Create Catalog

catalog_name = 'aliaga_catalog'
spark.sql(f"CREATE CATALOG IF NOT EXISTS {catalog_name }")
spark.sql(f"USE CATALOG {catalog_name}")
spark.sql("CREATE SCHEMA IF NOT EXISTS bronze")
spark.sql("CREATE SCHEMA IF NOT EXISTS quality")

Generate Data using dbldatagen

When testing new features or trying something acquiring data is always the first step, a while ago, I read an article about dbldatagen and I thought this was a good opportunity to use it and create a few iterations of messy data to test Lakehouse monitoring.

We’re using the example that can be found here: Generating Change Data Capture Data — Databricks Labs Data Generator 0.3.5 documentation

Let’s start defining a function that generates some random values that contain nulls

import dbldatagen as dg
import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType, StringType

def generate_data(data_rows=1000, begin=0, partitions_requested=32, seed=-1):
  dataspec = (
      dg.DataGenerator(spark, rows=data_rows, partitions=partitions_requested, randomSeed=seed)
        .withColumn("customer_id","long", uniqueValues=data_rows, minValue=begin)
        .withColumn("name", percentNulls=0.10, template=r'\\\\w \\\\w|\\\\w a. \\\\w', random=True)
        .withColumn("alias", percentNulls=0.10, template=r'\\\\w \\\\w|\\\\w a. \\\\w', random=True)
        .withColumn("payment_instrument_type", values=['paypal', 'Visa', 'Mastercard',
                    'American Express', 'discover', 'branded visa', 'branded mastercard'],
                    random=True, distribution="normal")
        .withColumn("int_payment_instrument", "int",  minValue=0000, maxValue=9999,
                    baseColumn="customer_id", baseColumnType="hash", omit=True)
        .withColumn("payment_instrument",
                    expr="format_number(int_payment_instrument, '**** ****** *####')",
                    baseColumn="int_payment_instrument")
        .withColumn("email", template=r'\\\\w.\\\\w@\\\\w.com|\\\\w-\\\\w@\\\\w', random=True)
        .withColumn("email2", template=r'\\\\w.\\\\w@\\\\w.com', random=True, percentNulls=0.40)
        .withColumn("ip_address", template=r'\\\\n.\\\\n.\\\\n.\\\\n', random=True)
        .withColumn("md5_payment_instrument",
                    expr="md5(concat(payment_instrument_type, ':', payment_instrument))",
                    base_column=['payment_instrument_type', 'payment_instrument'])
        .withColumn("customer_notes", text=dg.ILText(words=(1,8)), random=True, percentNulls=0.30)
        .withColumn("created_ts", "timestamp", expr="DATEADD(day, 0, now())")
        .withColumn("modified_ts", "timestamp", expr="DATEADD(day, 0, now())")
        .withColumn("memo", expr="'original data'")
        )
  df = dataspec.build()
  return df

As we can see on the code including “percentNulls=0.30” will generate null values in the data. In future iterations, I will modify that value to a lower or higher number to test the capabilities of the monitoring.

Creating Monitoring

Now that we have a table with data let’s create a monitoring using the library instead of the UI since I prefer it to be in a way that can be automated later if needed.

To create the monitoring without using the UI the docs points to a wheel, hopefully this will change in the future.

%pip install "<https://ml-team-public-read.s3.amazonaws.com/wheels/data-monitoring/a4050ef7-b183-47a1-a145-e614628e3146/databricks_lakehouse_monitoring-0.4.4-py3-none-any.whl>"
dbutils.library.restartPython()

Once it’s installed we can use it to create a monitor from a specific table, it will ask for:

Table Name: The table to monitor.
Profile type: Snapshot, Time series, Inference.
Schema: schema where it will create the table that will contains the metrics.

from databricks import lakehouse_monitoring as lm

lm.create_monitor(
    table_name=f"{catalog_name}.bronze.sample_data",
    profile_type=lm.Snapshot(),
  # schedule=lm.MonitorCronSchedule(
    #     quartz_cron_expression="0 0 12 * * ?", # schedules a refresh every day at 12 noon
    #     timezone_id="PST",
    # ),
    output_schema_name=f"{catalog_name}.quality"
)

After the execution of the previous cell, it will create two tables inside the quality schema and an automatic Dashboard with pre-defined visuals.

sample_data_profile_metrics: contains summary statistics
sample_data_drift_metrics: contains statistics related to the data’s drift over time.

The metric tables and dashboard created by a monitor are owned by the user who created the monitor. You can use Unity Catalog privileges to control access to metric tables.

What is key here is the two tables that it creates since it’s the important piece from where we can start to customize our own dashboard or alerts.

Schemas and docs for both tables: Monitor metric tables — Azure Databricks | Microsoft Learn

What can we find in the Dashboard?

There is a lot of information populating the dashboard that comes from the tables previously mentioned, from summary statistics to specific ones depending if the column is numeric or categorical.

We can also define our own metrics: Use custom metrics with Databricks Lakehouse Monitoring — Azure Databricks | Microsoft Learn

Visuals from the Dashboard.

Let’s insert data over a couple of days and see how it changes overtime, I have changed the number of nulls on the following iteration to get better results.

import random
df = generate_data(random.randint(2000, 4000), 2000, 1)

# Write to delta table
df.write.saveAsTable(
  name = "aliaga_catalog.bronze.sample_data", 
  mode = 'append'
)

Let’s execute a refresh to get the metrics for the new inserted data.

lm.run_refresh(
    table_name=f"aliaga_catalog.bronze.sample_data"
)

We can also see get the list of all the refreshes.

lm.list_refreshes(
    table_name=f"aliaga_catalog.bronze.sample_data"
)

After a few iterations this what the default dashboard looks like.

On the last ingestion I increased the number of nulls to 100%, and we can see on the Integrity over time how that’s represented with a spike at the end of the Graph.

Conclusion

While my initial impression was that it is mostly useful for Data Science exploration given the statistical analysis it can be really useful for other scenarios to proactive monitor your data and track your data quality without investing much effort in custom development.

It’s still on preview so what is left is see where all of this is going but definitely all of this will change the way organizations use Databricks and monitor their data.

Introduction to Databricks Lakehouse Monitoring — Azure Databricks | Microsoft Learn

If you need to build something more custom around Data Quality I highly recommend the following post from a good friend that talks more in depth about data quality.

Consistency — Data values should not conflict with other values across data sets.
Accuracy — There should be no errors in the data.
Validity — The data should conform to a certain format.
Completeness — There should be no missing data.
Timeliness — The data should be up to date.
Uniqueness — There should be no duplicates.

Implementing data quality with Databricks | by Josemanuelgarciagimenez | Dec, 2023 | Medium