BI on the Lakehouse

How to Connect Google Looker to Databricks Delta Lake

Frank Munz
Google Cloud - Community
6 min readApr 17, 2021

--

Databricks, with its lakehouse architecture, is available on the Google Cloud Platform (GCP) now. That means you can use your data science notebooks, the optimized Apache Spark engine, SQL Analytics, and Delta Lake with its open formats on all major clouds: You can do your analytics now where your data is.

Databricks also tightly integrates with GCP cloud services. For those with a BI or analytics background, the combination of Google’s Looker and Databricks lakehouse is particularly interesting. Looker is Google’s cloud-based enterprise platform for BI that lets you easily create stunning dashboards.

With Databricks on GCP, you can now directly build Looker dashboards on a lakehouse. Delta Lake is an open, reliable, performant, and secure data storage and management layer for your data lake — for both streaming and batch operations.

Implementing the architecture from this tutorial, no data has to be copied over from or to your data lake. A data warehouse is not needed, and your lakehouse becomes the single source of data.

To share my lessons learned while building Looker dashboards on Delta Lake, I created the following how-to video tutorial.

The steps below provide additional resources that users will find helpful. Those resources support the video tutorial above and are not meant to be a step-by-step guide.

First Things First

Google Looker

When I kicked off this project, I had no prior experience with Looker. This video helped me out to get started. To improve your Looker experience, my suggestion is to learn the core Looker resources and concepts before you start building a dashboard. Also, make sure you know how Looker projects work. If videos are not your thing, have a look at the document here for an overview (which is where the screenshot below is taken from).

Google Looker Projects, Models, Views, Dimensions, Measures, Explores

Databricks Lakehouse

A lakehouse is a scalable, low-cost option that unifies data, analytics, and AI. This blog posting from the Databricks blog is a great way to learn about lakehouse architecture.

Delta Lake is an open format, transactional storage layer that forms the foundation of a lakehouse. Delta Lake delivers reliability, security, and performance on your data lake — for both streaming and batch operations — and eliminates data silos by providing a single home for structured, semi-structured, and unstructured data.

Lakehouse: evolution from DWH and Data Lake

Get a Looker Account

Looker is not part of the free GCP trial. You can apply for Looker at the Google Marketplace. Note that I am not Google :-) so I cannot help with this part. Once you have your account, make sure you can log in.

Create a Databricks Workspace and Cluster

Creating a workspace is described in detail in the Databricks documentation. The documentation also lists the prerequisites, such as configuring resource quotas or enabling APIs on Google cloud.

A Databricks workspace is linked to a GCP project. When creating a workspace you have to Google cloud project ID which can be found in the Google cloud console by selecting the project name under Project Info.

Create Databricks Workspace for BI with Looker on GCP

In the Databricks workspace, create a Databricks cluster.

Create or use Delta Tables

Our goals is to build a Looker dashboard that accesses your data directly from Delta tables. Delta tables can be based on Google Cloud Storage as a scalable, highly available, and cheap storage layer. Use your existing tables or follow along the video tutorial to create your own sample data in a Delta table.

Create Databricks delta table for Looker

To create the Delta table used in the video tutorial run the Scala code snippet below in a notebook cell.

Create a Spark Database Connection

In the Looker console, create a connection to access your Databricks workspace on GCP where your cluster is running. The video tutorial will show you how to fill in all connection settings parameters.

Update: As of version 21.6 Looker supports the Databricks dialect for the database settings. Use this instead of the Spark 2.x setting that I showed in the hands-on video tutorial.

Create a LookerML Model with Views

With the Looker database connection pointing to your Databricks cluster, generate a LookerML model of your data. Within the model, tables are represented as views.

Looker model based on Delta Lake tables

Create a Looker Dashboard

Be creative :-) Drill into your data. Build your Looker dashboards. My advice is to learn a bit more about Looker before you dive deeper, e.g. the basics of Looker ML and the difference between dimensions and measures in dashboards.

Looker Dashboard based on Delta Lake

Looker Development vs. Production Mode

So far, you’ve created dashboards in development mode. Development Mode allows you to make changes to projects without affecting anyone else. This mode accesses a completely separate version of your project files that only you can see and edit.

Everyone using a Looker instance in Production Mode accesses its projects in the same state. Project files are read-only in this mode.

Looker uses Git for versioning. Once you configured a Git connection you can commit the changes and push them to production.

Mixed Delta Lake and BigQuery Dashboards in Looker

To add diagrams based on tables in BigQuery to the dashboard, simply configure a BigQuery database connection in the Looker console and create a Looker model with views based on the new connection.

Mixed Looker Dashboard with Delta Lake and BigQuery

A Looker 40 Second Quick Spin

Based on the steps listed above, I created a looker dashboard when Databricks on GCP was released. The video below captures a quick spin of this demo.

How about YOU?

I am curious to learn and see what you have built. Feel free to share a screenshot of what you built with Looker and Databricks in the comments section below.

References

Please connect on Medium and clap for this article if you enjoyed reading it as much as I enjoyed writing it. I spend way too much time on Twitter — follow for more data science, data engineering, or AI/ML related news: @frankmunz.

--

--

Frank Munz
Google Cloud - Community

Cloudy things, large-scale data & compute. Twitter @frankmunz. Former Tech Evangelist @awscloud, Principal @Databricks now. personal opinions here. #devrel ❤️.