Exposing Databricks to External Consumers

Henrik Thomsen
8 min readSep 9, 2020

Introduction

So you have built a Delta Lake using Databricks, and your corporate data now resides in a modern, scalable platform. Congratulations!

Next up; how do you share those data with stakeholders outside your organization? For example external stakeholders who want to connect to your data from a visualization tool such as Redash (or Power BI or Grafana).

Well, setting up a Databricks data connection in Redash is pretty straightforward (see here). But you may not want to expose all data in your corporate data lake to all consumers. This article demonstrates how to expose your data lake while filtering data so that consumers get access on a least privilege basis.

The methodology presented here is as follows:

  • create a Databricks User Group (for each consumer)
  • enable Table Access Control for your Databricks account
  • create a High-Concurrency cluster. Enable Table Access Control for the cluster. Grant the user group permission to attach to the cluster
  • create an external master database in which you create tables for your data lake files and optionally create views to transform data into a consumer-friendly schema
  • create a database with filtered data for each consumer , and grant the user group access to objects in that database
  • connect to the consumer database from Redash

Creating a Databricks User Group

To manage consumer access to your data lake, set up a Databricks user group for each consumer organization, and add users to it.

To create a user group, go to Admin console | Groups

Click Create Group. Let’s create a group corresponding to an external organization: Ext_HeartCode.

Create Databricks User Group

You can add users to the group at a later stage.

Let’s script the creation of a user group as a way of ensuring that the user group exists. We will need this later when we grant groups permissions to database objects.

%python
import json
import requests
BASE_URL = 'https://%s/api/2.0/' % (dbutils.widgets.get("domain"))def api(action, body):
""" A helper function to make the Databricks API request, request/response is encoded/decoded as JSON """
response = requests.post(
BASE_URL + action,
headers={'Authorization': 'Bearer %s' % dbutils.widgets.get("access_token") },
json=body
)
print(BASE_URL + action, body, ':', response.json())
return
def addGroup(name):
""" A function to create a Databricks user group """
b = {
"group_name": "%s" % name
}
api('groups/create', b)
return

The listing above shows how to create a function called addGroup which creates a user group using the Databricks ReST API.

The listing is written for a Databricks notebook, and the first line instructs the notebook to use the Python language. In this article we will only be using Python and SQL. Those are the only languages supported by high-concurrency clusters with Table Access Control enabled.

The input argument name is the name of the group to be created. The widgets used can be created as shown in the listing below. Widgets are parameters for Databricks notebooks, see here.

%sql
create widget text access_token default '<access token>';
create widget text domain default '<domain>.cloud.databricks.com';

You can get the access token from the admin page under User Settings | Access Tokens | Generate New Token.

Enabling Table Access Control

Databricks does not support row-based access control such as SQL database systems do, which is why we need to create a database for every consumer.

Databricks supports table access control (see here), but it is disabled by default. To enable it, go to Admin Console | Access Control.

  • “Cluster, Pool and Jobs Access Control” must be enabled
  • “Table Access Control” must be enabled
Enabling Table Access Control

Creating a High-Concurrency Cluster (with Table Access Control Enabled)

Tables Access Control is only supported on high-concurrency clusters. To set up a high-concurrency cluster with Table Access Control enabled, go to Clusters and create a new cluster.

Set Cluster Mode = High Concurrency

Create High-Concurrency cluster

Expand Advanced Options. Check “Enable table access control and only allow Python and SQL commands”. Grant “Can Attach To” access to the consumer user group.

Enabling Table Access Control for a High-Concurrency cluster and granting access to a user group

Creating an External Master Database

In order to expose data from Databricks to an external consumer you must create a database with tables that connect to your data lake files.

Creating a table in Databricks does not involve copying data from parquet files to some other data store, the Databricks database is a proxy or a view of data, if you will. So, just to be clear; data is fetched from it’s master source, but the Databricks database construct allows the consumer to query data using SQL and supports access control to database objects.

I suggest you don’t want to create separate schemas for each consumer, but rather one master database with one schema for all consumers. But, that depends on your business requirements.

As a minimum, the master database must link tables to data lake files. In addition to that, I suggest you want to expose the tables through views in order to transform the schema of the data lake files to a consumer friendly schema. Some of the transformations you may consider could be:

  • Data type conversion; as an example, Redash does not understand the data type boolean, you may want to convert that to integer
  • Flattening; data lake files such as parquet files or json files contain nested data structures. This may not be the friendliest format for your consumers (again, depends on your business needs). You may want to flatten data into tabular format

Here is a listing of a SQL script to create the master database which you can run from a Databricks notebook:

%sql
drop database if exists ext_master cascade;
create database ext_master;use ext_master;create table agentTable using DELTA location '$storage_location/Agent/';create table lessonTable using DELTA location '$storage_location/Lesson/';create view lesson as
select
l.object.id as id, -- Flatten fields
l.agent_id,
a.org_id,
l.name,
cast(overdue as integer) as overdue, -- Convert to friendly type
completed_timestamp
from
lessonTable l
join agentTable a
on a.id = l.agent_id

The master database presented in the listing above creates tables pointing to delta files and a view that transforms data to a consumer-friendly schema. The variable $storage_location is a database widget, that can be created like this:

%sql
create widget text storage_location default 's3a://<path to bucket>'

Creating a Database with Filtered Data for each Consumer

The master database can be exposed to consumers, but they would have access to all data. In order to filter data on a least-privilege basis, we need to create a database for each consumer. The consumer database will contain views that query the views in the master database while filtering data, like this:

%sqlcreate view lesson as
select *
from ext_master.lesson
here org_id = '56999'

And then we will grant access to the consumer user group:

%sql
grant select on Ext_HeartCode.lesson to Ext_HeartCode

We could create the consumer database with an SQL script, but that will not scale very well with the number of consumers. So, let’s create a function that creates a consumer database. We will use the Python language since the only languages supported by High-Concurrency clusters at the time of writing are Python and SQL.

%python
def createDatabase(db_name, org_id, user_group):
print('Create database %s for org id %s and grant access to %s'%(db_name, org_id, user_group))

sql = """
drop database if exists %s cascade
"""%(db_name)
print(sql)
spark.sql(sql)
sql = """
create database %s
"""%(db_name)
spark.sql(sql)
sql = """
use %s
"""%(db_name)
spark.sql(sql)
sql = """
create view lesson as
select *
from ext_master.lesson
where
org_id = '%s'
"""%(org_id)
spark.sql(sql)
sql = """
grant select on %s.lesson to %s
"""%(db_name, user_group)
spark.sql(sql)

The function createDatabase takes as arguments

  • db_name; name of the database to be created
  • org_id; id of the organization to be used as filter
  • user_group; name of the user group that must be granted access to the objects in the consumer database

Now, let’s execute the function createDatabase for every element of a list of consumers. Let’s also call the function addGroup to ensure that groups are created, otherwise the script will fail.

%python
from pyspark.sql.types import *
consumerSchema = \
StructType([StructField("db_name", StringType())\
,StructField("org_id", StringType())\
,StructField("user_group", StringType())])
consumers = [\
['Ext_HeartCode', '56999|prodtest-rqi1stop', 'Ext_HeartCode']\
,['Ext_Dignity', '56972|prodtest-rqi1stop', 'Ext_Dignity']\
]
consumers_df = spark.createDataFrame(
consumers,
schema=consumerSchema)
for row in consumers_df.collect():
addGroup(row['user_group'])
createDatabase(db_name = row['db_name'],
org_id = row['org_id'],
user_group = row['user_group'])

Disclaimer: careful with ownership

It’s important that all scripts are executed by the same Databricks login. All database objects in both the master database and the consumer databases must have the same owner. Otherwise you will get permission errors when the consumers try to access the consumer database objects (even if they are granted access to them). I suggest dropping and recreating all databases in one notebook or job.

Setting Up a Databricks Data Connection in Redash

To set up a data connection to the consumer database in Redash, log in to your Redash account and click the settings icon:

Redash settings

Click New Data Source.

Create Data Source in Redsah

Select Databricks. and click Create

Data Source settings

To get information needed for the fields Host and HTTP Path, go to Cluster | Advanced Options | JDBC/ODBC.

Cluster JDBC/ODBC settings

The Access token is to be generated by the external user. The user must go to User Settings | Access Tokens | Generate New Token

Create access token for user

Once the data source is created, you can query the filtered data:

Create query in Redash. The dataset should only contain data from a single organization

Conclusion

It wasn’t straightforward to figure this out, but I feel it was worth the effort to be able to share data with external stakeholders by just providing a data connection. The alternatives I considered would be more costly, but could be relevant for some business cases:

  • Expose data through APIs. For that I would probably copy data from the data lake to a SQL database or a document database such as MongoDB. And then, build APIs to expose data
  • Expose data through a storage service (such as AWS S3 or Azure Storage Account). This would involve creating a storage bucket for every consumer, setting up a job to copy filtered data to them and manage access to them

I think all languages should be supported by High-Concurrency clusters, not just SQL and Python. My team use Scala for mosts tasks, we would prefer if we could limit the number of platforms and languages we have to learn (but it’s not the end of the world).

Possible enhancements of the presented solution? Not sure, looks perfect to me! You could of course script the creation of users, adding users to groups and granting group “attach to” permission to clusters. It is outside the scope of this article, but you can find information about Databricks ReST API here.

Anyway, I hope someone finds this useful, any feedback is appreciated.

--

--