Protecting your lakehouse data with format-preserving encryption (FPE)

Published in

Databricks Platform SME

5 min readMay 27, 2023

Your data is one of your most valuable assets and always needs to be protected.

We’ve had generation X, Y, Z, and now we’re well into Generation Alpha. So what’s next? Generation AI. Whether you’re a humanist author or AI researcher, one thing that everyone agrees on is that the people, companies, and societies that embrace AI will outpace those that do not. But AI needs data, and data needs to be protected — now more than ever before. In this blog, I’ll walk you through an example of one of the methods you can use to protect your lakehouse data — format-preserving encryption or FPE.

What is format-preserving encryption (FPE)?

Format-preserving encryption is a cryptographic algorithm that encrypts data such that the output (the ciphertext) preserves the format of the input (the plaintext). The meaning of “format” can vary, but typically a finite character set is used — such as numeric, alphabetic, alphanumeric or ASCII.

FPE can be useful when you want to de-identify data but retain its original structure, such that the format of the data is preserved for downstream processing. An example might be graph analytics — you want to build a graph of entities that is consistent with the real world values that make up the nodes in that graph. But you also want the end user to be able to easily identify the types of entities — for example IP addresses, SSNs, phone numbers. The end user doesn’t need to be able to identify real world people — just identify patterns in the graph before passing on an action or referral to a different team for consideration and potentially re-identification.

As an example of what FPE looks like, a plaintext SSN of 055–46–6168 might be encrypted as a ciphertext like 569–83–4469 or an IP address of 76.217.83.75 might be encrypted as a ciphertext of 97.381.64.35 once FPE has been applied.

FPE is considered to be a pseudo-anonymisation technique, because if you have access to the encryption key, the ciphertext can always be reversed, revealing the original plaintext prior to encryption.

Format-preserving encryption for your lakehouse

You can access the notebook containing the examples below by cloning the repo here.

Step 1: Generate some fake PII data using faker

Fake PII data generated by faker — Generate some fake PII data using faker

Step 2: Setup the encryption key and tweak

The first thing you’re going to need is a key and tweak to encrypt your data. You can use the code at this step to generate a random 256 bit key and 7 byte tweak. I would recommend storing these values as secrets, and then retrieving them at encryption / decryption time by calling the Databricks secrets APIs.

Step 3: Define the character sets and the expected behaviour for special characters

As well as defining the character sets to use for our encryption, we also need to define the behavior we want to see when we encounter special characters in the plaintext. This is important if we want the output ciphertext to retain the same structure as the input plaintext. In this example there are 2 options:

tokenize — Tokenize the whole string including special characters with an ASCII charset. Note that this won’t preserve the format but will tokenize the data so that it is pseudo-anonymised in a reversible way.
reassemble — Try to preserve the format of the input string by removing the special characters, tokenizing the alphanum characters and then reassembling them both afterwards. With this method a plaintext SSN of 210–42–9398 will be encrypted as a ciphertext like 816–58–5332 once FPE has been applied.

Step 4: Declare some helper functions and our Pandas UDFs

In this example we’re using python-fpe for the encryption & decryption and wrapping this code in a pandas UDF.

Step 5: Encrypt the data with FPE

Now we’ve done the hard work, this is as simple as applying the UDF to each column:

Step 6: Decrypt the data with FPE

Again, simply apply the UDF to the columns you want to decrypt:

Additional considerations

For completeness, it’s worth noting the following considerations:

It’s impossible for FPE to preserve anonymity if given very small values to encrypt. Think about it: if the encryption always has to be reversible then the same plaintext will always generate the same ciphertext. And if I send it enough small values to encrypt, I can potentially start to reverse the algorithm. As such, if you try and encrypt values that are too small, you may start to see errors like: Message length x is not within min 5 bounds. For most types of PII this isn’t a huge problem, but it’s worth bearing in mind (you won’t be able to FPE encrypt someone’s age for example). Similarly, I wouldn’t recommend this approach for long strings — for these kind of columns its much better to just AES encrypt the entire thing.
FPE only works with strings. If the data you want to encrypt is not a string, you’ll need to cast it to one in order to encrypt it. You can always cast the output value back to its original type afterwards.
As with most security enforcing measures, there will be a performance impact. Even with vectorised UDFs, performing FPE operations is never going to be as fast or optimised as the many functions supported by Apache Spark. As ever therefore, FPE introduces a trade off between security and speed. It’s up to you to tweak the dial towards the outcome you value most!
As with any type of encryption, FPE is only ever as secure as your key. As such, I would recommend the following best practices to help protect it:

a) Always store the encryption key and tweak as secrets
b) Only provide appropriately permissioned users (potentially even just service principals) access to the secret scope
c) Consider performing the de-identification / re-identification in a completely isolated environment from downstream processing

Wrapping Up

Second to people, data is your most valuable asset and always has to be protected. FPE is a method that you can use to protect the identity of your data subjects whilst maintaining the structure of the data. Stay tuned for more blogs on best practices for handling PII data on your lakehouse coming soon…!