Meeting Security Requirements for Dataflow pipelines — Part 2/3

Lorenzo Caggioni
Google Cloud - Community
4 min readSep 7, 2023

Apache Beam is an open-source unified model and programming model for large-scale data processing. It provides a high-level programming model that can be used to process data on a variety of distributed platforms such as Google Cloud Dataflow.

To have Dataflow and Beam touching your production data, the data in your organization, you have to deploy it in a secure way, in a way compliant to your company’s policies.

The most common requirements are:

  • Internal addressment of tenants must be private.
  • Every tenant must be isolated and dedicated to a specific system of services. (This article)
  • All data must have encryption at-rest with keys managed by the company security team. (coming soon)

In this article we will focus on “Every tenant must be isolated and dedicated to a specific system of services”.

This blog post is part of a set of articles providing an in-depth analysis of GCP’s security practices to deploy your Apache Beam pipeline on Cloud Dataflow.

Reference use case

Let’s start describing a simple Apache Beam pipeline as reference to be used in the article. You have data stored in Cloud Storage that need to be processed and stored in BigQuery.

High level use case diagram: moving data from Cloud Storage to BigQuery using Dataflow.

Every tenant must be isolated and dedicated to a specific system of services

The requirement to address is “Every tenant must be isolated and dedicated to a specific system of services”. For this we will cover 2 aspects:

  • assign minimum roles to service Accounts involved
  • separate resources in different projects.

Let’s start to understand which are the Service accounts involved running a Dataflow pipeline and which are the roles needed for each of them.

There are at least 3 different service accounts involved:

  • The Dataflow Service Agent, it Start/Stop workers. It is not touching any user data. (service-<project_num>@dataflow-service-producer-prod.iam.gserviceaccount.com)
    It needs the `dataflow.serviceAgent` role. If a shared VPC is involved, it also need the compute.networkUser role.
  • The Worker Service Account, this is the identity processing and accessing your data. By default this is the Compute default service account (<project_num>-compute@developer.gserviceaccount.com), it needs the `dataflow.worker` role and roles to read and write data on GCS and BigQuery
  • The third service account is the one orchestrating the pipeline. It needs to have the `dataflow.admin` role to start and monitor Dataflow jobs and the `iam.serviceAccountUser` role to be able to run jobs with the Worker Service account.
Diagrams highlighting Service Accounts involved.

But it is not just about service accounts. It is also important to think about how to distribute your resources across Google Projects.

It is a common practice to split compute resources from storage resources. For many scenarios it is also relevant to split raw data from curated data.

So here we have resources split on 3 projects: landing, processing and curated.

Just to mention all Service Account involved also here, there will be a service account able to drop data into the Cloud Storage bucket (for example your application that is producing data) and another Service Account able to read data from BigQuery (for example to visualize data on a dashboard).

Conclusion

To deploy an Apache Beam pipeline on GCP is important to take into account several configurations to meet your security requirements. In this article we identify what to configure to ensure that every tenant are isolated and are dedicated to a specific system of services.

Here you can find other articles on the topic:

  • Internal addressment of tenants must be private.
  • Every tenant must be isolated and dedicated to a specific system of services. (This article)
  • All data must have encryption at-rest with keys managed by the company security team. (coming soon)

--

--

Lorenzo Caggioni
Google Cloud - Community

Multiple years of experience working in Integration and Big Data projects. Working at Google since 2010, lead technical integration with global customers.