Using HKDF in BigQuery via UDFs for Improved Security Hygiene

Theodore Siu
Google Cloud - Community
4 min readJul 14, 2021

Introduction

BigQuery natively supports encryption/decryption methods which allows users to not store sensitive data/columns in plaintext. Additionally with the advent of data protection laws such as GDPR’s “right to be forgotten” clause, these functionalities allow for easy data deletion processes via cryptoshredding.

However with cryptoshredding in BigQuery, there are some challenges for encrypting each user’s data with its own key. Namely:

  1. The encryption key is stored in plaintext json or in bytes. An attacker can get access to all of user’s data by getting access to a single key and decrypting.
  2. Even with key rotations, creating two keys per user and rotating them can be difficult to organize when the number of users in BQ dataset is large.

A solution to this single key problem is the following: Create a shared key per dataset and derive the encryption key using a key derivation function which intakes two keys. In our solution we choose HKDF as our key derivation function and we compute the key using a BigQuery UDF as there is no native BigQuery functionality at this time.

While we wait for a native BigQuery solution for key derivation, we present a short term solution for key derivation in BigQuery using Javascript UDFs. In the key derivation, two separate keys are combined/encrypted in order to create a final key to perform encryption/decryption. The two separate keys can be stored in separate datasets/tables for separate IAM permissions. In this way, an additional layer of protection is added and the main key is not stored in plaintext. See below for a schematic of the described process.

Key derivation can be achieved in BigQuery using a Javascript UDF. Two keys sitting in different tables/datasets can be combined to derive a final encryption key. This adds an additional layer of protection as the encryption key is not stored in plaintext ready to be used.

BigQuery UDF Solution

In order to achieve key derivation using a Javascript UDF we leverage webpack for bundling additional Node dependencies. The method is detailed in this blog and is quite effective. The key derivation library that we test can be found here. Our package.json looks like the following:

{
“name”: “hkdf.js-bigquery”,
“version”: “1.0.0”,
“scripts”: {
“build”: “webpack — config webpack.config.js”
},
“dependencies”: {
“futoin-hkdf”: “^1.3.3”
},
“devDependencies”: {
“webpack”: “^4.41.5”,
“webpack-cli”: “^3.3.10”
}
}

By running npm install it is possible to download the dependencies.

var path = require('path');module.exports = {
entry: './node_modules/futoin-hkdf/hkdf.js',
output: {
path: path.resolve(__dirname, 'dist'),
filename: 'hkdf.js',
library: 'hkdf',
libraryTarget: 'var',
},
mode: 'production',
};

We then run npm run-script build to generate hkdf.js which can then be uploaded into a GCS bucket to be invoked in BigQuery.

Its now possible to create the key derivation UDF similar to the following below:

CREATE or replace FUNCTION theo_test.DERIVE_KEY(masterKey STRING, domainKey STRING) RETURNS STRINGLANGUAGE jsOPTIONS (library=["gs://theo-gcs/hkdf.js"])ASr"""const length = 32; //Needs to be 32 charactersconst info = 'optional-context';const hash = 'SHA-256';const encoding = 'base64';return hkdf(masterKey, length, {'salt': domainKey, 'info': info, 'hash': hash}).toString(encoding);""";

Additionally, a UDF needs to be written to create a json so that the output of the key derivation can be used as a valid AesGcmKey Tink key.

CREATE or replace FUNCTION theo_test.TO_TINK_no_json(keyId NUMERIC, key STRING ) RETURNS STRING AS (CONCAT("{\"key\": [{\"keyData\": {\"keyMaterialType\": \"SYMMETRIC\", \"typeUrl\": \"type.googleapis.com/google.crypto.tink.AesGcmKey\",\"value\": \"",key,"\"}, \"keyId\": ",keyId,",\"outputPrefixType\": \"RAW\", \"status\": \"ENABLED\"}],\"primaryKeyId\": ",keyId,"}"));

Note that the length of the output key needs to be 32 characters as the AesGcmKey in BigQuery is recognized as a 34 byte key. In order to make the key usable as a Tink key in BigQuery we need to prepend [26,32] in a bytearray to the output of the Derive Key function. Details on making a valid Tink key can be found on this Stackoverflow question. Now it possible to encrypt and decrypt values back and forth with key derivation!

Encryption:

select AEAD.ENCRYPT(KEYS.KEYSET_FROM_JSON(theo_test.TO_TINK_no_json(100,to_base64((SELECT concat(CODE_POINTS_TO_BYTES([26,32]), from_base64(theo_test.DERIVE_KEY('my_key1','key_salt1'))))))),from_base64("test"), cast("foo" as Bytes));

Decryption:

Paste your encryption result into the field Paste your encrypted value here!!!. Also note that the encrypted value is permuted so it will not always deterministically be the same!:

select AEAD.DECRYPT_BYTES(KEYS.KEYSET_FROM_JSON(theo_test.TO_TINK_no_json(100,to_base64((SELECT concat(CODE_POINTS_TO_BYTES([26,32]), from_base64(theo_test.DERIVE_KEY('my_key1','key_salt1'))))))),from_base64("Paste your encrypted value here!!!"), cast("foo" as Bytes));

Final Thoughts

We hope that this solution will help customers in the short term. While the UDF solution works for key derivation, we await a native BigQuery optimized solution as there are some drawbacks to using a third party library with UDFs. Namely:

  • Using a third party open source library is subject to community support and also audits may be required to check the robustness and security of the library. Additionally, figuring out how to use the javascript library with other products such as Dataflow may require more overhead.
  • Prepending bytes to key derivation results is hacky and may otherwise not be forwards compatible especially for future versions of Tink and BigQuery encryption.
  • UDFs are quite compute intensive. We suggest clustering by domain key to see if this will help with scaling and computing lots of key derivations.

Stay tuned for more improvements to BigQuery encryption!

Special thanks to Parviz Deyhim for the review and being an awesome resource on all things related to cryptoshredding in BQ!

--

--