How disabling external IPs helped us cut down over 80% of our Cloud Dataflow costs

Harshit Dwivedi
4 min readAug 29, 2019

--

Tl;dr : Set --usePublicIps=false in the execution parameter on your Dataflow pipeline.

For some context, we’re making a real-time data aggregation pipeline which guarantees that you get your web/app user data made available to you in near-real time (5–6 secs).

Think of it like Google Analytics, but on steroids!

Our entire infrastructure is built on Google Cloud Platform and we are using the following products within the GCP family :

  1. App Engine : To receive incoming requests from our user’s webpage and app.
  2. PubSub : A scalable and reliant messaging queue that accepts the messages coming to App Engine.
  3. Dataflow : A data processing pipeline which reads messages coming from Pub Sub and transforms them according to our needs.
  4. BigQuery : A data warehouse where all the data ingested by Dataflow is saved to.

To give you a scale of things we’re trying to do, here’s a screenshot outlining the average traffic we handle from a single property of our friends over at Z1Media :

Around 4000 Requests per second, which is over 350M events per day!

The challenge here is obviously to keep the throughput and reliability of the system as high as possible while keeping the costs to a minimum.

We recently switched to BigQuery loads instead of streaming data directly into BigQuery which helped us cut down the costs by a fraction of 30% per day as Loading Files into BigQuery is free and we kept the data loading frequency to a bare minimum so that the data is near-realtime in nature.

Enabling BigQuery File Loads requires you to write the data to a Google Cloud Storage bucket where BQ can load the data from.
To see if this would incur us any costs, we quickly looked at the docs to see the ingress/egress price for GCS.

Turns out that the Ingress/Egress data to and from Google Cloud Storage within products in the same region was Free and lucky for us, both Cloud Dataflow and BigQuery were in the same region of Tokyo (yay!).

Within a day of this new setup in place, we were overjoyed by the cost reduction until we saw a new pricing metric being added to our billing (which was more than what streaming inserts used to cost us).

This was strange for us since the docs mentioned that the egress was free!

Googling for what Carrier Peering means gave us no meaningful results either; the official docs mentioned that the Carrier Peering was to be used when you wanted to access G-suit products from within Google Cloud, which is something we weren’t doing.

On searching for questions on Stackoverflow and reddit, we revisited the docs for compute engine which stated that this ingress/egress pricing was free and a minute info caught our attention.

Egress to Google products (such as YouTube, Maps, Drive), whether from a VM in GCP with an external IP address or an internal IP address

Internal IP address was something we assumed that our dataflow would be using since we nowhere specified it to use an external IP address; and since we didn’t want our data to be available to anyone else apart from BigQuery we didn’t need to have an external IP in the first place!

Turns out that by default, Dataflow will enable external IPs and you need to pass a flag --usePublicIps=false while executing your Dataflow pipeline that disables it if you want to do so.

But just passing the flag isn’t enough as our pipeline failed immediately when we tried to start it with the following message :

Workflow failed. Causes: Subnetwork ‘’ on project z1media network ‘default’ in region asia-northeast1 does not have Private Google Access, which is required for usage of private IP addresses by the Dataflow workers.

Turns out that by default, accessing other Google Cloud services via internal IP is not allowed; so to fix this; we went to VPC network from the Cloud Dashboard and selected asia-northeast1 since that was where all of our products were situated in.

From within there, we simple enabled the option to allow “Private Google Access” and that was it!

Be sure to save the changes once the edits are done 🤷‍♂️

And that was it!

We ran our pipeline after making these changes and everything was back to normal without affecting our workflow.

If you are working at a high growth company and want your data made available to you as soon as it’s created; take a look at https://roobits.com/ and we might be what you are looking for!

Thanks for reading! If you enjoyed this story, please click the 👏 button and share to help others find it! Feel free to leave a comment 💬 below.

Have feedback? Let’s connect on Twitter.

--

--