Meeting Security Requirements for Dataflow pipelines — Part 1/3

Lorenzo Caggioni
Google Cloud - Community
4 min readSep 5, 2023

Apache Beam is an open-source unified model and programming model for large-scale data processing. It provides a high-level programming model that can be used to process data on a variety of distributed platforms such as Google Cloud Dataflow.

To have Dataflow and Beam touching your production data, the data in your organization, you have to deploy it in a secure way, in a way compliant to your company’s policies.

The most common requirements are:

In this article we will focus on “Internal addressment of tenants must be private”.

This blog post is part of a set of articles providing an in-depth analysis of GCP’s security practices to deploy your Apache Beam pipeline on Cloud Dataflow.

Reference use case

Let’s start describing a simple Apache Beam pipeline as reference to be used in the article. You have data stored in Cloud Storage that need to be processed and stored in BigQuery.

High level use case diagram: moving data from Cloud Storage to BigQuery using Dataflow.

Internal addressment of tenants must be private

The requirement to address is “Private and internal addressment”: to achieve this there are different aspects to take into consideration.

First, we need to ensure that Dataflow workers do not have a public IP. This is easy, when deploying your pipeline (from gcloud, Airflow or the way you launch it …) you have to specify a flag: `disable-public-ips`.

Turn off public IPs on workers.

But when you remove the public IP to workers, those won’t be able to reach internet resources, including Google APIs. Those are the APIs needed to access Cloud Storage or BigQuery … Your worker won’t be able to reach for example the storage.googleapis.com. We can fix this by enabling a flag, “Google Private Access” on the network subnet hosting your workers.

Enable Google Private Access on the VPC subnet.

The feature lets you reach the Google front end behind the google APIs in a private path, even if you do not have public IP. If you also need to access Google APIs on a private path from on-premises, you can configure Google Private Access from on-premises.

Enable Google Private Access from on-premises.

To really make all resources private, we also need to take into consideration that Google APIs are public endpoints. In some scenario you may want to control the source of google APIs requests. We can address it by enabling VPC-Service Control. VPC-SC helps you control who and from where can access Google APIs related to your GCP Org resources.

You can think about VPC-SC as a way to create a tenant within the google APIs related only to your own resource. You can create a perimeter around your projects. Resources within the perimeter can’t be accessed from outside of the perimeter and vice versa, resources outside of the perimeter can’t be accessed from resources within the perimeter.

VPC-SC perimeter high level architecture

Once you have created a perimeter around your resources, you can specify exceptions to it and enable trustable sources.

For example I can specify that my resources can be accessed only through the VPN.

Conclusion

To deploy an Apache Beam pipeline on GCP is important to take into account several configurations to meet your security requirements. In this article we identify what to configure to ensure that your tenants have private resources.

Here you can find other articles on the topic:

--

--

Lorenzo Caggioni
Google Cloud - Community

Multiple years of experience working in Integration and Big Data projects. Working at Google since 2010, lead technical integration with global customers.