Meeting Security Requirements for Dataflow pipelines — Part 3/3

Lorenzo Caggioni
Google Cloud - Community
3 min readOct 26, 2023

Apache Beam is an open-source unified model and programming model for large-scale data processing. It provides a high-level programming model that can be used to process data on a variety of distributed platforms such as Google Cloud Dataflow.

To have Dataflow and Beam touching your production data, the data in your organization, you have to deploy it in a secure way, in a way compliant to your company’s policies.

The most common requirements are:

  • Internal addressment of tenants must be private.
  • Every tenant must be isolated and dedicated to a specific system of services.
  • All data must have encryption at-rest with keys managed by the company security team.

In this article we will focus on “All data must have encryption at-rest with keys managed by the company security team”.

This blog post is part of a set of articles providing an in-depth analysis of GCP’s security practices to deploy your Apache Beam pipeline on Cloud Dataflow.

Reference use case

Let’s start describing a simple Apache Beam pipeline as reference to be used in the article. You have data stored in Cloud Storage that need to be processed and stored in BigQuery.

High level use case diagram: moving data from Cloud Storage to BigQuery using Dataflow.

All data must have encryption at-rest with keys managed by the company security team

We need to recall that all data in transit and at rest on Google Cloud are encrypted by default. But the requirement is to have the customer control keys used to encrypt your data.

So, have a quick overview on how encryption works in GCP, let’s use as example an object stored in a Cloud Storage bucket:

High level diagram describing encryption at rest on Google Cloud Storage.
  1. The Object is split in chunks, each chunk is encrypted with a Key. We call this key Data encryption Key (DEK)
  2. Then we get those DEK and we encrypt them with an additional key, called Key Encryption Key.
  3. Chunks of data are stored with the encrypted DEK attached to them.

If you want to know a deeper

By default the KEK is handled by Google, but you can enable Cloud KMS, a service that let you handle the key lifecycle, pick up the KMS version that best suits your security needs

You can protect your data using different types of keys:
- Generated software keys
- Cloud HSM (hardware-backed) keys
- Cloud External Key Manager (externally managed) keys

Once you identify the right flavor, it is important to identify the right way to implement them in your organization.

The Cloud KMS resource is usually deployed in a separate project. A project managed by the security team and shared across all workloads.

CMEK High level architecture - Project separation

Remember, to enable your resources (Dataflow, GCS, BigQuery, …) to use the key, the Service Identity service account for each service need to have `cryptoKeyEncrypterDecrypter` role on the key.

Conclusion

To deploy an Apache Beam pipeline on GCP is important to take into account several configurations to meet your security requirements. In this article we identify what to configure to ensure that every tenant are isolated and are dedicated to a specific system of services.

Here you can find other articles on the topic:

--

--

Lorenzo Caggioni
Google Cloud - Community

Multiple years of experience working in Integration and Big Data projects. Working at Google since 2010, lead technical integration with global customers.