Protecting Sensitive PII data in Azure SQL Databases

Krishna Golla
8 min readApr 14, 2020

There could be so many business use cases or scenarios where you want to make sure that you are always protecting your sensitive PII (Personally identifiable information) data of your clients or customers to prevent any external hacks or for regulatory reasons. This becomes of paramount importance especially when you host your solutions in a Cloud data platform. We could protect our data from external threats by using strict Firewall rules, private connectivity and threat detection etc but it doesn’t help if there is an insider who has access to the underlying data storage and can look and compromise your precious PII data which could lead to Data exfiltration.

In this article we will look at how we can make sure to protect the data using Client side encryption techniques and also on how we can incrementally add sensitive data to your cloud datastore without ever exposing your PII data in plain text format.

One important distinction between Server-side encryption and Client-side encryption is that Server-side encryption is focused around encrypting the data at the storage layer (at rest). This would protect if someone physically stole the disks and tried to read the data but its not possible in a cloud infrastructure. But in a cloud storage scenario, A DBA who would have access to the underlying database will still be able to read the data in plain text despite having Server-side encryption enabled. Hence, we definitely need Client side encryption alongside with Server side encryption to make sure that PII data is always protected at all times and is only accessible by clients who can decrypt the data and has access to the encryption keys. Anyone else including the admins will only be able to see the encrypted data and not the plain text data.

This approach has been widely adopted by most of the health care and financial industries where PII data needs to be tightly protected. In Azure, Most of the PaaS services already have a server-side encryption to protect data at rest like Transparent data encryption for Azure SQL Databases. For client side encryption, we can accomplish this natively with an encryption technology called Always encrypted for Azure SQL Database or any SQL Server datastore. There are several data tokenization solutions out in the marketplace which can tokenize this data as well but its only effective for small strings and integers. However, if your data contains large text columns which has PII information then these tokenization solutions are not so effective and can become a bottleneck. Example: Call center voice transcripts which may have sensitive information like a SSN or a Creditcard number etc. That’s where Always encrypted can handle this effectively out of the box

There are several examples online which showcases how to enable Always Encrypted on an existing database. These are well documented steps and would not like to rehash them there. However, in real world data is not always within the database. It originates from various systems and flows its way downstream into the database. Hence, this article will focus on how to build a pipeline to move your data which is originating daily on different systems on-premises to be inserted into a Azure SQL Database or any SQL form factor on Azure which is enabled with Always encrypted in a secure way using Azure Key Vault and Azure Data Factory(ADF). There is not much information out there on using ADF for data loading into Always encrypted tables and hence decided to share some insights here.

In the below example, We will move data sitting in a CSV file On-premises into an Azure SQL Database table which has Always Encrypted enabled. We will be using Azure Data factory which is the managed data orchestration service to move the data, encrypt the data on the fly in the cloud and load it into the target Azure SQL Database

Sample CSV showing Medical claims which contains PII data in the last com

Here is a sample CSV file which has a large text column which contains sensitive PII data in the Rawtext column. This is a medical claim system where claims are filed through different source systems like Mobile, Phone and Web etc. This file gets generated everyday and we need to make sure that this data is stored in an encrypted manner in our cloud data store and nobody can access this data except the privileged application users who has access to the keys in our Key Vault authenticated by Azure Active Directory.

Here is the target table where we want to store the claims data in the cloud in Azure SQL Database. Please note that the Rawtext column is encrypted with Always encrypted using the Column Encryption Key.

CREATE TABLE [dbo].[tblClaims](
[Claimid] [int] NULL,
[Claimsource] [nvarchar](13) NULL,
[Rawtext] [nvarchar](max) COLLATE Latin1_General_BIN2 ENCRYPTED WITH (COLUMN_ENCRYPTION_KEY = [CEK_Auto3], ENCRYPTION_TYPE = Deterministic, ALGORITHM = 'AEAD_AES_256_CBC_HMAC_SHA_256') NULL
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
GO

This Column encryption key is encrypted with Column Master key and the Column master key is stored in Azure Key vault. These are the steps in order to enabled Always encrypted on the table columns.

CREATE COLUMN ENCRYPTION KEY [CEK_Auto3]
WITH VALUES
(COLUMN_MASTER_KEY = [CMK_Auto4],
ALGORITHM = 'RSA_OAEP',
ENCRYPTED_VALUE = 0x01A6000001680074007400700073003A002F002F007300720067006F006100650
)
GO

Column Master Key is stored in Azure Key Vault. At this point, the user who is running the below script needs to have access to the key vault to create this key and assign the right Access policies so that appropriate users can retrieve this key and decrypt the data

CREATE COLUMN MASTER KEY [CMK_Auto4]
WITH
( KEY_STORE_PROVIDER_NAME = N'AZURE_KEY_VAULT',
KEY_PATH = N'https://srgoaekv.vault.azure.net:443/keys/CMKAuto4/6df37e31807947e6861bc1619d3521aa')
GO
Azure Key Vault showing Column Master Key

Now, we can go ahead into Azure Data Factory to build a pipeline to load this data by using a Copy Data activity which loads the Onprem CSV file by using the FileSytem source and then loading this data by using an ODBC sink which writes it to the table by using Always encrypted. All the magic happens within the ODBC sink connection where you provide the connection string information. In order to accomplish this we have few important requirements

Here is the link to the Github repo which contains all the code on how to deploy this ADF pipeline and SQL Scripts

  1. Self Hosted Integration Runtime (SHIR)—Compute environment which executes data factory workflows, It can access Onprem Datastores and it can also to write to the sink which is an Always Encrypted table in Azure SQL Database using the ODBC sink. You can install this runtime on an On-premises VM or on any other VM in Azure\AWS\GCP. Please see the note here
  2. Service Principals — to access Azure Key Vault to access keys to encrypt\decrypt data while writing to or reading from Azure SQL Database. Steps to create a service principal
  3. ODBC Sink — This is the only supported sink to write to an Always encrypted table. All other sinks doesn’t support Always encrypted as of this writing.

The most important part of this process is the Connection string which you put in the ODBC sink settings and selecting the Self hosted Integrated runtime under under Connect via integration runtime option. Please note the highlighted settings which are exclusive to enable writes or reads from an Always encrypted table.

DRIVER={ODBC Driver 17 for SQL Server}; SERVER=gosqleastus.database.windows.net;PORT=1433;DATABASE=srgoae;UID=servicesql;PWD=Password123;ColumnEncryption=Enabled;Trusted_Connection=No;KeyStoreAuthentication=KeyVaultClientSecret;KeyStorePrincipalId=8a176ca9–943c-446f-bf2c-94ecc70bb9a4;KeyStoreSecret=h0kvYr2XvREnFMxb.a8L/GmMy=nk8Kt-

KeystorePrincipalID and Keystore secret are the service principals which we have created in Azure AD in Step 2 for Azure Key vault authentication. We can use those credentials here. Please note that you don’t have to expose your passwords in the connection string here. I am showcasing the connection string only for illustration purposes, You can save this connection string as a secret in Azure Key vault and access the secret in the ODBC sink settings which is the recommended best practice for security.

Once all these settings are configured correctly and the right access has been granted at the appropriate resources, you can trigger this ADF pipeline to start loading the data. We are using Self hosted integration runtime here meaning that data will be directly encrypted within your On-premises environment by the ODBC driver and will load the target Azure SQL Database table. In this way, data is encrypted in transit and at rest all times once it leaves your On premises network.

Once the data load is complete. You can see that the data is loaded into Azure SQL Database table.

If any user authenticates and connects to the table regularly without specifying column encryption information they will only see encrypted data in the Rawtext column as shown below

However, if a user is able to provide the Column encryption information in the connection string and user also has access to the Key vault where the key is located then they will be able to see the decrypted (plaintext) data as shown below.

Not just for SQL Server clients but any client(In this case, Databricks) which leverages ODBC or any of the supported drivers will be able to access and decrypt Always encrypted data given the right credentials.

Always Encrypted is currently only supported for Azure SQL Database PaaS service. Synapse SQL doesn’t yet support Always encrypted as of this writing. Azure SQL Database has a Hyperscale tier which can support upto 100TB of database size and hence you can leverage this to store all your encrypted PII data within Azure as a managed service.

In this way, you can load incremental data from various sources into an Azure SQL database from your On-premises environments in a secure manner protecting you from all kinds of security and data exfiltration risks.

--

--