Data Profiling of External Data Source Using Databricks Unity Catalog

A step-by-step guide

Dubey Deepti

Published in

Slalom Data & AI

4 min readNov 17, 2023

By Deepti Dubey and Joshua Garcia

Data profiling is the process of analyzing and examining data to gain a better understanding of its structure, quality, content, and metadata. The purpose of investigating the quality of the data is to understand what we are pulling through and ensure it is formatted and organized appropriately, allowing us to make better decisions. Data structure, data quality, data patterns, data relationship, and statistics are all outcomes of data profiling that help data teams gain insights and even discover data before finally loading it to a data store.

Unity Catalog

Data profiling can often be expensive and time-consuming, requiring lengthy queries to perform the data-quality checks. Databricks has a great feature called Unity Catalog that provides centralized access control, auditing, lineage, and data-discovery capabilities across Databricks workspaces. There are different object models for Unity Catalog, including metastore, catalog, schema, and table/views/volume. This way, customers can enable teams to efficiently manage and collaborate on their data assets.

Foreign Unity Catalog

A foreign catalog is part of the Lakehouse Federation feature announced by Databricks at the 2023 Data+AI summit. It is a securable object in Unity Catalog that mirrors a database in an external data system. This enables users to perform read-only queries on that data system in a Databricks workspace, managing access using Unity Catalog. There are multiple external sources available under a foreign catalog such as Snowflake, MySQL, Azure Synapse, PostgreSQL, Google BigQuery, SQL Server, and Amazon Redshift. This way, you can simply connect your external system and get the benefits of Unity Catalog such as data discovery and lineage data profiling.

Below are the steps for using a foreign catalog to connect to a Snowflake account as an external data source and to perform data profiling.

Prerequisites:

Ensure your workspace is enabled for Unity Catalog.
There should be network connectivity from your Databricks Runtime cluster or SQL warehouse to the target database systems.
Databricks clusters must use Databricks Runtime 13.1 or above and the shared cluster access mode.
SQL warehouses must be pro or serverless.
To create a foreign catalog, you must have the CREATE CATALOG permission on the metastore and be either the owner of the connection or have the CREATE FOREIGN CATALOG privilege on the connection.

Step 1: Create Connection

A connection specifies a path and credentials for accessing an external database system.

Step 2: Create a Foreign Catalog

To create a foreign catalog, give a name under the “Catalog name” box, set Type to Foreign, and select the connection you just made in step 1 under the Connection drop down to the data source.

Features available

By using Unity Catalog for external sources, you can also leverage its built-in features for analysis. You can manage the permissions, look at the history, and find out the lineage and insights of your underlying data.

Different built-in features of Unity Catalog

Step 3: Generate Data Profiles

Now that the connection and catalog are created, it’s time to generate data profiles. Databricks Unity Catalog helps to generate the data profiles in an efficient way as it takes care of all queries on the back end and gives you all the metadata details in numerical and visual format.

Go to “New tab” in the left sidebar and select “notebook.” Run select query for the table you want to do data profiling on (“select * from <catalog>.<schema_name>.<table_name>”).

You will see a “Data Profile” button and a “+” sign.

Click on “Data Profile” to generate the numeric and graph representation for profiling.

You can add the outcome to a dashboard or generate the visualization for your data.

Data profiling has always been an important aspect of data engineering, and with Databricks Unity Catalog it is now easier than ever to achieve. One can quickly view the profile at any point in time by generating a DataFrame from any notebook. This way, you do not have to bring any data from external sources to Databricks (no need to duplicate the data), and it saves the time otherwise needed to write/generate complicated queries for data profiling.

Note: Lakehouse Federation is under public preview as of date of publication.

Slalom is a global consulting firm that helps people and organizations dream bigger, move faster, and build better tomorrows for all. Learn more and reach out today.