Generating BI Reports on Encrypted Data using Azure Databricks

Prosenjit Chakraborty
May 7 · 5 min read

Data encryption in cloud is an important factor for the enterprises specially banking and financial companies.

Though different security features (specially in Azure Databricks) have been implemented at different levels e.g. encryption at rest, encryption in flight, role based access controls, network level security etc., enterprises may want to store sensitive information in cloud with their chosen encryption technology.

In this blog, we’ll see how we can achieve this using Azure Databricks, Azure Key Vault and Power BI.

To start with, let’s take an example of customer information where Email and Aadhar are sensitive information.

Email, Aadhar are sensitive information and shouldn’t be stored as plain text in cloud

Encryption Method

In this example we have chosen Java Advanced Encryption Standard (AES) APIs to encrypt and decrypt. For more details on Java AES, follow this nice article.

We have used the sample encrypt and decrypt methods written by the same author.

We have considered a secret key = abcdefgh123456789 and used the following Java code to encrypt all of the records inside the above RED block.

public static void main(String[] args) {
final String secretKey = "abcdefgh123456789";
String originalString = "<input your String>";
String encryptedString = AES.encrypt(originalString, secretKey) ;
String decryptedString = AES.decrypt(encryptedString, secretKey) ;
System.out.println(originalString);
System.out.println(encryptedString);
System.out.println(decryptedString);
}

After encrypting at on-premise, we have the records like below:

AES encrypted with the secret key

Upload Encrypted Data into Cloud

After encrypting the required columns we have imported the records into an Azure Databricks table (we could store into Azure SQL Database or SQL Data Warehouse as well).

Store Secret at Azure Key Vault

As we’ll be needing the secret key to decrypt the records, we have stored it into Azure Key Vault.

Azure Key Vault — Create a secret

There are few simple steps required to configure a secret scope at Azure Databricks. Follow my previous blog.

Once configured, check it by executing:

Execute in a Databricks notebook

Create a Library for Decryption

Again referring the same blog on Java AES, we have created a utility jar (e.g. AES.jar) which contains the decryption method. The method expects two strings as arguments — the column value we want to decrypt and the secret key.

Create a Custom UDF

We need to create a permanent Hive user defined function (UDF) which uses the AES.jar to decrypt input encrypted string.

We can find a sample UDF, here.

To write the UDF we also need hive-exec*.jar to be included. Get the latest jar from: https://mvnrepository.com/artifact/org.apache.hive/hive-exec

package com.pro.security.udfimport org.apache.hadoop.hive.ql.exec.UDF; // hive-exec*.jar
import com.pro.cryptotest.AES; // AES.jar
class DecryptString extends UDF {
def evaluate (strToDecrypt: String, secretKey: String): String = {
AES.decrypt(strToDecrypt, secretKey)
}
}

The UDF is expecting two parameters: (i) the field to be decrypted, (ii) the secret key which we will fetch from the Azure Key Vault we have configured already.

Once the coding is done, compile and export as a JAR say, DecryptStringv2.jar (ensure, not to include the AES.jar & hive-exec*.jar inside).

Upload the JARs

Let’s go to the Azure Databricks, select or create a cluster and install the JARs — AES.jar (contains the decrypt utility method) and DecryptStringv2.jar (contains the UDF).

Create a Function

We’ll create a permanent function on the UDF we have created. Follow this for more information on the syntax.

%sql
CREATE FUNCTION fnDecryptString AS 'com.pro.security.udf.DecryptString';
The function needs to be permanent

Create a View

We’ll create a view which will use the function fnDecryptString to decrypt the encrypted fields.

%scalaval query: String = 
s"""CREATE OR REPLACE VIEW vw_ClientMasterEncrypted
AS SELECT Name, City,
fnDecryptString(Email,
'${dbutils.secrets.get(
scope = "my-key-vault-secrets",
key = "secretKey")}') as Email,
fnDecryptString(Aadhar,
'${dbutils.secrets.get(
scope = "my-key-vault-secrets",
key = "secretKey")}') as Aadhar
from ClientMasterEncrypted"""
spark.sql (query)

Now, if we select from the view, we can get the encrypted fields decrypted!

Connect from Power BI

We can connect to the view from Power BI. We should restrict to the table/view access to the appropriate BI user only. For further details on access control from BI tool, please check out my blog.

Points to note

  • The decryption code shouldn’t contain any debug statements/prints/display of the sensitive records. Otherwise, these will be stored in Spark/Databricks logs.
  • Azure Key Vault access should be restricted to specific groups/users. As, the stored secrets can be retrieved from there.
  • Views are executed in memory, however in case of huge data returned by the underneath Spark SQL, the data can spill over to the worker node disks.
  • We should avoid using UDFs unless absolutely necessary as Spark doesn’t optimize the code inside.
  • Calling the decryption functions for huge records may take time so, we should try to generate reports using plain texts columns only (may be few reports having limited record sets need column-decryption).
  • Azure supports encryption at rest for different services; even column level encryption for Azure SQL Database is supported. We can check if those are sufficient for our use cases.

Conclusion

With the above discussed approach, without storing sensitive information as plain texts, we can use views and dynamically decrypt on the fly to serve BI users. However, in case we have some complex decryption steps (e.g. calling some URL with encrypted values/objects and receive decrypted values), this may not be the suitable way to follow.

Thanks for reading. Connect with me on LinkedIn.

Prosenjit Chakraborty

Written by

Tech enthusiast, works with Apache Spark, Scala, Kafka & Azure Services.