Have you ever dealt with PII data when building your data pipelines? Then this article is for you.
Column-level security is a measure used to protect sensitive data stored within a database by restricting access to specific columns or fields in a table. This level of security ensures that only authorized users can view or modify sensitive data and prevents unauthorized access or disclosure. Column-level security is important because it provides an additional layer of protection beyond standard database-level security measures and can be used to comply with regulatory requirements or prevent data breaches. By implementing column-level security, organizations can better protect their sensitive data and reduce the risk of data loss or theft.
Understanding the Basics of Column Level Security
Column-level security is a data security technique that protects sensitive information more granularly within a database. It allows organizations to control access to specific columns or fields of a table by defining policies that limit access to only authorized users or groups. Column-level security can be implemented through various methods, including encryption, access controls, and masking.
- Encryption ensures only authorized users can view the data using a decryption key.
- Access controls allow organizations to define who has access to specific columns based on user roles or other criteria.
- Masking replaces sensitive data with non-sensitive data, such as asterisks or random characters, to prevent unauthorized access.
Now you’re familiar with the basics of column-level security. We will use Fernet Python Library to apply the column-level security using the encryption technique in Databricks.
Fernet guarantees that a message encrypted using it cannot be manipulated or read without the key.
Problem Statement
A new regulatory law came last week, and as part of security compliance, your manager asked you to start work on this as the highest priority. The task is to encrypt any PII data by default as it comes in the OLAP eco-system.
For the demo, we will create a ‘customers’ table with PII data.
%sql
use default; -- Change this value to some other database if you do not want to use the Databricks default
drop table if exists customers;
CREATE TABLE customers (
first_name STRING,
last_name STRING,
email STRING, -- PII Column
phone_number STRING -- PII Column
)
USING delta;
Let’s insert a few records into the ‘customers’ table.
%sql
INSERT INTO
customers (first_name, last_name, email, phone_number)
VALUES
('John', 'Doe', 'johndoe@example.com', '555-1234'),
(
'Jane',
'Smith',
'janesmith@example.com',
'555-5678'
),
(
'Bob',
'Johnson',
'bobjohnson@example.com',
'555-9012'
);
It is important to note that we must create a 32-bit master key to do the encryption. You can use the following snippet to generate the key. This code will work in your Databricks Cell as well.
%sh
openssl rand -base64 32
(output) ~ ZTcOZJpyTf+39T6T8U6wJEr6O+O6mpICx+jCnExvc6A=
Now, we are ready to play with cryptography. First, Let’s define functions for encryption and decryption using Fernet.
# Define Encrypt User Defined Function
def encrypt_val(clear_text, MASTER_KEY):
from cryptography.fernet import Fernet
f = Fernet(MASTER_KEY)
encrypted_value = f.encrypt(bytes(clear_text, encoding='utf8'))
return encrypted_value.decode()
# Define decrypt user defined function
def decrypt_val(cipher_text, MASTER_KEY):
from cryptography.fernet import Fernet
f = Fernet(MASTER_KEY)
decrypted_value = f.decrypt(bytes(cipher_text, encoding='utf8'))
return decrypted_value.decode()
We will use these functions as UDFs in PySpark.
from pyspark.sql.functions import udf, lit, md5
from pyspark.sql.types import StringType
# Register UDF's
encrypt = udf(encrypt_val, StringType())
decrypt = udf(decrypt_val, StringType())
# Encrypt the data
df = spark.table("customers")
encrypted = df.withColumn("email_crypted", encrypt("email",lit(pii_key)))
display(encrypted)
This should print the output as below.
Now, let’s see how decryption works. As we already have defined the decrypt function. It is also quite simple to use.
decrypted = encrypted.withColumn("email_plain", decrypt("email_crypted",lit(pii_key)))
display(decrypted)
This will result in the following output.
Exercise
As I have only encrypted the ‘email’ column, You can encrypt the ‘phone_number’ column to do the practice. The above code is tested in the databricks environment and should work as is. If not, Please highlight the error in the comment.
Conclusion
Now you’re all set to implement column-level security in your projects. You can also use masking and access controls to implement the same.
Here is the full script:
Part 1 — SQL Code
%sql
-- Change this value to some other database if you do not want to use the Databricks default
use default;
drop table if exists customers;
CREATE TABLE customers (
first_name STRING,
last_name STRING,
email STRING,
phone_number STRING
)
USING delta;
Exercise
Part 2 — PySpark Code
from pyspark.sql.functions import udf, lit, md5
from pyspark.sql.types import StringType
# Generate 32-bit master encryption key
# %sh
# openssl rand -base64 32
pii_key = 'ZTcOZJpyTf+39T6T8U6wJEr6O+O6mpICx+jCnExvc6A='
# Register UDF's
encrypt = udf(encrypt_val, StringType())
decrypt = udf(decrypt_val, StringType())
# Encrypt the data
df = spark.table("customers")
encrypted = df.withColumn("email_crypted", encrypt("email",lit(pii_key)))
# display(encrypted)
decrypted = encrypted.withColumn("email_plain", decrypt("email_crypted",lit(pii_key)))
display(decrypted)
Resources:
Databricks Notebook Sample — https://www.databricks.com/notebooks/enforcing-column-level-encryption.html
Thanks for reading!
🤝🏼 You can follow me on Twitter or LinkedIn.
💬 Always welcome your thoughts or a conversation below!