DATA ARCHITECTURE

10 key points to protect your data in the cloud

A practical guide to securing your data & metadata

Gaurav Thalpati
CodeX
Published in
5 min readJul 3, 2022

--

Photo by FLY:D on Unsplash

Data Security is critical. We all know it.

We all want to secure our data — whether in the lake, warehouse or lake house.

And when it comes to practically implementing the data protection rules, we don't want to break it down into theoretical categories such as

  • data security or data governance
  • data or metadata
  • data lake or data warehouse

We just want to ensure that we implement all the required controls to protect our data — All of it. Wherever it resides, whenever it moves, and whoever accesses it!

Here are 10 points to consider to protect your data.

1. Protect the data stored in your cloud data lakes.

Data lakes are the backbone of any data ecosystem. You should ensure that the data residing in your data lake is always secure.

One of the most popular and widely used approaches to secure your data in the lake (also known as "data at rest") is to use encryption.

You can use the inbuilt encryption services the cloud platform provides for encrypting your data.

E.g. In the case of AWS, you can use the AWS KMS service for data encryption. If you are using Azure, you can explore Azure Key Vault services.

2. Protect the data stored within your cloud warehouses.

Like data lakes, you should also encrypt your data in the warehouse.

Some warehouses provide features to encrypt the data while loading in the warehouse. Some of the warehouses might have this as a default or mandatory option.

Having encryption enabled ensures data security.

If you are using Amazon redshift, encryption is not mandatory. However, it is required for data sharing across regions or accounts. So it is best to keep encryption enabled.

For Snowflake, encryption is a default option, and all data loaded in Snowflake gets encrypted by default.

3. Secure the data while moving in/out of your cloud storage (lakes + warehouses)

For AWS implementations, use Site-to-Site VPN for secured, encrypted connection or Direct Connect to ensure that data flows only through the dedicated internal network instead of the internet. It helps in protecting your data as well as increasing data movement speed.

For AWS Direct Connect implementations, for moving data to S3, use the S3 interface endpoints to transfer data securely. Please refer to this blog from AWS for more details on this. I've not implemented this, but it seems like a more secure approach than moving data through the internet.

If you are connecting to Snowflake from AWS, consider using AWS Private Link to create private VPC endpoints between AWS and Snowflake VPC.

4. Secure the sensitive columns while accessing any table.

Are your providing access to your users to query data? If yes, does this data consists of any PII/PHI data?

In such cases, you must mask the sensitive data for non-eligible users.

You don't want all your users to see sensitive data like mobile numbers, credit card numbers etc. You can mask such columns so that it is visible only to eligible users who are supposed to see such sensitive data.

AWS Lake Formation & Databricks Unity Catalog supports fine-grained access control to hide sensitive columns from users or specific roles.

Snowflake has Dynamic Data Masking, which can be used for masking sensitive columns.

5. Secure the sensitive data used for searching or joining other tables.

There can be PII columns in some of your tables that your business teams will need access to. These attributes can be customer identifiers or credit card numbers.

If you mask these sensitive columns, businesses won't be able to do a search on these or join these with other tables.

In such cases, you will have to use other data abstraction mechanisms like tokenization or hashing. You can create unique tokens for every record and use these to join with other tables.

Tokenization can help in creating random but reversible tokens. Post accessing the sensitive info, you can retokenize them to change the tokens.

6. Protect the data accessed by multiple personas/users/roles.

You should provide the correct levels of access based on the user roles.

Based on the roles, access should be granted to various layers of the data lake. There are different methods of controlling access.

  • Layer-based access controls — Who gets access to which layers (Bronze/Silver/Gold) of the data lake
  • Table-based access controls — Who gets access to which tables. For E.g. the Marketing team should not have access to HR tables.
  • Attributes-based access controls — Only authorized users get access to Sensitive data attributes.
  • Tag-based access control — Similar to above, but here access is controlled based on tags created for tables or columns.

7. Provide temporary time-bound access.

Do you need to provide access to some specific data to some specific users — for a temporary period?

There will be scenarios when some sensitive attributes or objects in the data lake might need access by specific users. In such cases, have a strategy to provide time-bound temporary access to such users.

You can make use of AWS STS tokens to provide temporary access tokens.

8. Only valid/authentic users must access data securely.

Authenticate the users before providing them with any levels of access. For internal users, leverage the Cloud IAM services to authenticate users.

For external users, use external Identity providers like google, amazon etc. Avoid providing access to external users with direct links to published data.

9. Secure metadata to ensure only eligible users discover data.

Metadata is as important as your data. Treat your metadata as a first-class citizen in your data ecosystem.

Provide access to your metadata catalog only to users who should be accessing it and not to all the users in the system.

E.g. The test team should not have access to production metastore or data catalog. When creating the catalog in the production environment, ensure that only the eligible users can access it.

10. Secure data in the data lake by providing access only through query engines.

Avoid giving direct access to objects/files in data lakes.

Let everyone access data using query engines through tables created on these files. This can help implement fine-grained access control at the table and column levels.

E.g. Instead of giving direct access to the S3 objects, let your data analysts, quality engineers, and analytics teams access data through Athena using Lake Formation-based access control.

Direct access to production data lake data should only be provided to the role that executes your jobs in production. Only temporary access can be provided to support engineers for any incident investigations.

These are just guidelines to protect your data. There are multiple approaches and various tools available to protect your data.

What is essential is instead of spending efforts in categorizing data protection types or categories or debating whether it's the data security or the data governance or infrastructure team's responsibility — you should spend more effort in implementing a holistic approach to protect all of your data that is stored, moved or being access by internal or external users.

Hope this helps. Thanks for reading.

--

--