External tables in Azure Databricks with underlying data in Azure Data Lake gen2

Published in

Walmart Global Tech Blog

2 min readOct 14, 2020

There are number of ways in which we can create external tables in Azure Databricks. This blog will try to cover the different ways, pros and cons of each and the scenarios where they will be appropriate.

Databricks accepts either SQL syntax or HIVE syntax to create external tables. In this blog I will use the SQL syntax to create the tables.

Note: I’m not using the credential passthrough feature.

Method 1. Using Mount Points

The first way is using the mount point in the location field. The mount points are created using the service principal at the required folder level in ADLS 2. Authentication will be done using service principal and OAuth 2.0

Ex:

CREATE TABLE db_name.table_name(
col_name1 col_type1,....
)LOCATION "dbfs:/mnt/mount_name/......"

We don’t need to specify external in the create command as we are specifying location and it will automatically create an external table.

The advantage of this method is it's the easiest way to start accessing and manipulating the data. Let’s say the data owner creates the mount point and it is available across the workspace. The business users and data engineers can use these mount points and manipulate the table data.

The disadvantage of this method is there is a high risk of unauthorized access to data. Everyone who has access to the workspace can query this table. Let’s say a business user ‘a’ who is not supposed to have access to table (ADLS2 location) but has access to the workspace then ‘a’ can still query the table because the authentication is happening via service principal.

Method 2. Using full ADLS2 path

In this method we use the full abfss path in the location field. The authentication still happens via service principal.

Ex:

CREATE TABLE db_name.table_name(
col_name1 col_type1,
...
)  
USING data_source
LOCATION "abfss://container_name@storage_account/..."

CREATE TABLE db_name.table_name(
col_name1 col_type1,
...
)
USING data_source 
OPTIONS(path "abfss://container_name@storage_account/...")

The full create table syntax can be found here

The main advantage of this method is tables are accessible only to the users who know the authorization codes. These codes need to be executed in the same session. In the previous method even though we are not sharing the authorization codes the unauthorized user is still able to access because of the service principal whereas here the user needs to specify the code in each session.

The disadvantage is if the service principal details are exposed, the user can access the tables.

Conclusion

In this blog we’ve covered the ways in which external tables can be created in Azure Databricks. There are advantages and disadvantages to each, choose the one which suits your production scenario.

External tables in Azure Databricks with underlying data in Azure Data Lake gen2

Method 1. Using Mount Points

Method 2. Using full ADLS2 path

Conclusion

Written by Praneeth Harpanahalli