Access to Azure Data Lake Storage Gen2 using secret scope in Databricks
Introduction
Data lakes are used to hold massive amounts of data, it serves a quite well when it comes to Big Data. on the other hand, Databricks is Apache Spark-based unified data analytics platform — making big data simple.
Let’s see how we can connect to raw data dumped to Data Lake using Databricks secret scope. below given is file path. This file will be accessed from Databricks work space and do some transformations and save back to Azure Data Lake Storage
https://<storageaccount>.dfs.core.windows.net/<container>/<folder>/HouseData.csv
What is Secret and Secret Scope? A secret is a key-value pair that stores secret material and A secret scope is collection of secrets identified by a name. There are two types of secret scope: Azure Key Vault-backed and Databricks-backed.
We specifically see batabricks-backed scope. Before creating secret scope, we must create secret in Azure Key vault.
Step 1.
We use access keys of storage account to authenticate our applications when making requests to this Azure storage account. Access key should be copied
Go to Azure Storage account and in the left pane under settings you could find “Access Keys” . once you clicked, you will be seen that there are 2 Keys auto-generated & copy one of the key. It will be used in creating secret in Azure Key Vault.
Step 2.
Go to Azure Key Vault, in the resource menu, click secrets under Settings category. Then, click + sign (Generate/Import) in the command bar. you will be prompted with below window
Give unique name for name and paste copied access key in step 1 in the place of Value. leave others default and click create. you are done with creating secret for accessing storage account.
Step 3.
Navigate to properties under resource menu of Key Vault & copy DNS name and Resource ID and save it in notepad. This will be used whilst creating secret scope.
Step 4.
Go to https://<your_azure_databricks_url>#secrets/createScope
Ex- https://southeastasia.azuredatabricks.net#secrets/createScope
you will be directed to below window
Give scope name which is uniquely identified in the database maintained by Databricks, leave Manage Principal as Creator and paste copied values of DNS name & resource ID in place of DNS Name and Resource ID fields. Click Create & you will be given with success message. Remember scope name or save it in file.
Step 5.
Configuration of connection string in python notebook in Databricks work space.
spark.conf.set(“fs.azure.account.key.<storage_account>.dfs.core.windows.net”, dbutils.secrets.get(scope = “<scope_name>”, key = “<scope_key>”))
once you put storage_account, scope_name, scope_key with values we created, please execute it.
Done. Congrats! you are successfully accessed to Azure Blob Storage .
Now, we are going to create DataFrame using spark object and do simple transformation and save (with different file format) this back to Azure Data Lake which can store any type of files.
df = spark.read.csv(“abfss://<containername>@<storage_account>.dfs.core.windows.net/path/to/file”, header=True)
df1 = df.limit(10)
df1.write.format(‘parquet’).save(“abfss://<containername>@<storage_account>.dfs.core.windows.net/output”)
once it is successfully executed, you can see in the Azure storage , new folder ‘output’ created under specific container which you given.
Conclusion
This section covered basic direct access connection to storage account using secret scope in which we can be able to hide authentication details to other users. creating secret in Azure key vault and creating secret scope in Databricks are major steps. Accessed file stored in Data Lake using Spark configuration connection in Databricks & done ETL job as well.
That’s it! If you find any difficulties or need any clarifications, Feel free to leave a comment. I will try my best to clear it or other medium users might be able to help out!