Lakehouse Federation — Single Pane of Glass for all Your Data

Databricks SQL SME
DBSQL SME Engineering
8 min readNov 28, 2023

How to set up and govern federated query engines on DBSQL.

Lakehouse Federation — Source: https://www.databricks.com/blog/introducing-lakehouse-federation-capabilities-unity-catalog

Author: Megan Fogal

Intro:

In today’s data-driven world, organizations often face challenges managing data spread across various systems. Multiple databases, various data warehouses, and disparate storage solutions may all hold different, but crucial information needed for informed decision making. As organizations aim to extract insights from this wealth of data, the complexities of integration and seamless querying become apparent.

Enter query federation — a promising solution within the domain of data integration. In general, query federation is the process of integrating and querying data from multiple data sources as if they were a single, unified source. This is useful when these sources need to be used together for real-time analytics, business intelligence, reporting, and data exploration.

Databricks now has what’s called Lakehouse Federation — a query federation platform that allows you to run queries across multiple external data sources, right within Databricks. This is done by setting up read-only connections to popular database solutions. Now, you may be wondering, why would a platform that stresses a lakehouse architecture with central storage of data introduce a product that allows for many disparate systems? The reality is that most organizations have many data systems in production that are tailor-made for a specific need. However, users ideally want one place to actually use all this data.

To unify governance across all these data sources, Unity Catalog (UC) is used to manage federation. You can use UC’s governance and lineage tools on federated objects the same way you would with other UC objects. This data is within UC under a feature called a foreign catalog, which mirrors the database in the external system it’s connected to.

In this blog, we are going to set up query federation using PostgreSQL and MySQL. We will explore the performance of using federated sources, focusing on improvements driven by the use of materialized views and predicate pushdown features. Ultimately our data flow will look like this:

Author Screenshot: End to End UC Lineage on Federated Sources

Let’s get started!

Setup Overview

2 things need to be created to use federation:

  1. A connection — this specifies a path and credentials for accessing the external database system
  2. Foreign catalog — a Unity Catalog object that will mirror the database in the external database system

Step 1 — Create Connection

Lakehouse federation currently supports connections to: MySQL, PostgreSql, Redshift, Snowflake, Azure Synapse, Google BigQuery, and Databricks. We’ll make a PostgreSQL and a MySQL connection below.

A connection can be created both programmatically and via the Databricks UI. To do this programmatically , run the following query:

CREATE CONNECTION megan_postgresql TYPE postgresql
OPTIONS (
host '<hostname>',
port '<port>',
user '<user>',
password '<password>'
);

If you prefer to use Databricks secrets for your credentials, you can set it up like this:

CREATE CONNECTION megan_postgresql TYPE postgresql
OPTIONS (
host '<hostname>',
port '<port>',
user secret ('<secret-scope>','<secret-key-user>'),
password secret ('<secret-scope>','<secret-key-password>')
)

Since these connections are securable objects in Unity Catalog, permissions can be granted and revoked on them using the USE CONNECTION privilege. This is what allows all data sources to be governed under one “pane of glass”, making user access more seamless and administration simpler.

GRANT USE CONNECTION ON CONNECTION megan_postgresql TO <user-name>

Now let’s make the MySQL connection:

CREATE CONNECTION megan_mysql TYPE MYSQL
OPTIONS (
host '<hostname>',
port '<port>',
user '<user>',
password '<password>'
);

Step 2 — Create Foreign Catalog

Now that we have the connection to PostgreSQL and MySQL set up, let’s make foreign catalogs for these databases so we can query and manage access to the data.

Again, this can be done programmatically or via the UI. Both are done in an almost identical way to creating a normal UC catalog.

PostgreSQL:

CREATE FOREIGN CATALOG megan_postgres_catalog
USING CONNECTION megan_postgresql
OPTIONS (database 'qf_postgres_demo_database');

Here, I gave my foreign catalog the name megan_postgres_catalog using the megan_postgresql connection I created in the last step. In this example, qf_postgres_demo_database is the name of the database in postgres that I want to mirror in Databricks.

MySQL:

CREATE FOREIGN CATALOG megan_mysql_catalog
USING CONNECTION megan_mysql;

You’ll notice that for the MySQL catalog, we didn’t specify a database like we did for PostgreSQL. This is simply because MySQL uses a two-layer namespace, so it doesn’t require a database name.

Now, we will see our foreign catalogs pop up in the Catalog Explorer, just as any other catalog would.

Author Screenshot: Catalog Explorer of Foreign Catalogs

Now that we have these, we can query this data as normal. But remember to note, these queries will be read-only as we aren’t actually ingesting any data.

For example, we could join a table from PostgreSQL with a table from MySQL:

SELECT pg.platform, pg.date, pg.action, pg.session_id, pg.url FROM megan_postgres_catalog.public.loyalty pg
LEFT JOIN megan_mysql_catalog.qf_mysql_demo_database.external ms
ON pg.user_id = ms.user_id

We can now seamlessly utilize these operational database tables directly in conjunction with any Delta Tables in Databricks, unlocking many additional use cases!

Step 3 — Make a materialized view

In a SQL query, federated tables can be referenced the same way as any other UC table. It’s often beneficial to make a materialized view when loading external data to offload access to your external databases and to avoid high/concurrent loads on those databases. These can also be used for cross-source joins and complicated transformations.

Let’s make a materialized view of the join we ran in step 2:

CREATE MATERIALIZED VIEW website_analysis AS
SELECT pg.platform, pg.date, pg.action, pg.session_id, pg.url FROM megan_postgres_catalog.public.loyalty pg
LEFT JOIN megan_mysql_catalog.qf_mysql_demo_database.external ms
ON pg.user_id = ms.user_id

Now, when we query the materialized view, we don’t have to hit our PostgreSQL and MySQL databases to get the data, speeding up the result. However, if changes are made to our source database, we can refresh our materialized view which will go back to the source databases and get the latest data. For more information on materialized views, check out this doc.

Performance Recommendations

When it comes to any sort of query federation, performance is often a consideration. The performance that you get is often based on several factors including the data source type, query optimization, network, and caching. We’ll explore 2 ways to speed up performance: materialized views and predicate pushdown.

  1. Materialized views

Let’s take a look at the query performance of the join between our 2 foreign catalogs we performed in step 2. We can see below from the query profile that this query took about 5.7 seconds total on an 2-XS Serverless DBSQL Warehouse. From this we can also see where that time was spent, such as scheduling, running, and optimizing. Check out this blog for a deep dive on the query profile.

Author Screenshot: Query Profile Stats Page
Author Screenshot: Query Profile Tree

Realistically 5.5 seconds is quite fast. But let’s say we have an SLA which requires this to be sped up, and this joined table is frequently queried by our analysts. This is where we should consider a materialized view.

Now to compare, let’s perform a SELECT * on our materialized view from step 3 that consists of the same join, to see how well that performs.

SELECT * FROM website_analysis
Author Screenshot: Query Profile with Materialized View

We can see this ran in about 1.8 seconds which is faster. This will really shine when we query it again. This is due to the results being returned from cache, avoiding the need to recompute or fetch the same data again. Below you can see, when I run the query a second time, it completes in 744 ms. So when our analysts come in and query this frequently, they can expect fast results.

Author Screenshot: Querying Cached Federated Data with MVs

From this we can see that materialized views on federated sources are often beneficial when there are complex transformations and/or you are joining data from multiple external data sources.

2. Predicate pushdown

The goal of predicate pushdown is to ‘push’ the filtering down to the data source, to avoid scanning and returning unnecessary data before it even comes to Databricks. When filtering is performed at this low level, queries can be sped up since the entire data set doesn’t need to be loaded into the DBSQL Warehouse.

In Databricks, all you need to do is add filters when reading from your source system. Lakehouse Federation natively supports predicate pushdown. To learn more about which types of pushdowns are supported for each external source, check out the federation docs.

Back to our example, we’ll create a table that joins our PostgreSQL and MySQL tables and use a filter to specify the platform type we care about. To utilize predicate pushdown, we can add a filter to this table creation for the platform.

CREATE OR REPLACE TABLE ios_analysis AS
SELECT pg.platform, pg.date, pg.action, pg.session_id, pg.url FROM megan_postgres_catalog.public.loyalty pg
LEFT JOIN megan_mysql_catalog.qf_mysql_demo_database.external ms
ON pg.user_id = ms.user_id
WHERE pg.platform = 'ios';

If we look at the query profile, we can see that 273,405 rows were read and the query took about 4.5 seconds.

Author Screenshot: With Predicate Pushdown

Now let’s compare to doing the same thing, but without the platform filter.

CREATE OR REPLACE TABLE all_analysis AS
SELECT pg.platform, pg.date, pg.action, pg.session_id, pg.url FROM megan_postgres_catalog.public.loyalty pg
LEFT JOIN megan_mysql_catalog.qf_mysql_demo_database.external ms
ON pg.user_id = ms.user_id;

In the query profile, this time 352,722 rows were read and the query took closer to 5.5 seconds.

Author Screenshot: No Predicate Pushdown

By specifying a filter in the first query, the platform condition is processed during data retrieval, so less data is transmitted to the engine. Using these filters to take advantage of predicate pushdown is particularly useful when you’re dealing with a large dataset, where all the data is not necessary, or when the queries are highly selective, which is very common in OLTP-based data sources like MySQL. In this case, if we have a team that only cares about records for the ios platform, it makes sense to include the filter at the start to improve the performance.

Lineage and governance

One of the most powerful features of utilizing Lakehouse Federation is the ability to govern and track lineage across ALL data sources via Databricks. As mentioned before, query federation is governed by Unity Catalog. Now that we’ve built out additional tables and views, let’s take a look at our catalogs. If we look at our PostgreSQL foreign catalog for example, we can see it looks the same as any other catalog, including setting permissions at the catalog, schema, and table level.

Author Screenshot: External Catalog Governance in UC

We can also get the lineage diagram to view all upstream and downstream tables:

Author Screenshot: End to End Lineage in UC and External Data Sources

Conclusion

Thank you for reading, I hope you found this article helpful. Be on the lookout for more how-to articles and reach out if there is something specific you would like to see from the DBSQL SME team.

--

--

Databricks SQL SME
DBSQL SME Engineering

One stop shop for all technical how-tos, demos, and best practices for building on Databricks SQL