TL;dr : Use file loads with a relatively low frequency instead of Streaming Inserts with Cloud Dataflow
This story is a prequel to the one I wrote here (Yeah, I’m a star wars fan and sequels come before prequels 🤷♂️)
How disabling external IPs helped us cut down over 80% of our Cloud Dataflow costs
Puzzled by insane costs of “Network Egress via Carrier Peering Network”? Read on as this article is what you’re looking…
If you read the blogpost above, you know that we’re trying to build a real-time data aggregation pipeline and to do so, we’re using the following products within the GCP family :
- App Engine : To receive incoming requests from our user’s webpage and app.
- PubSub : A scalable and reliant messaging queue that accepts the messages coming to App Engine.
- Dataflow : A data processing pipeline which reads messages coming from Pub Sub and transforms them according to our needs.
- BigQuery : A data warehouse where all the data ingested by Dataflow is saved to.
The challenge here is obviously to keep the throughput and reliability of the system as high as possible while keeping the latency and the costs to a minimum.
As the current state of things, we handle over 300M events per day and these events are stored into BigQuery as soon as they arrive via Google Cloud Dataflow to ensure that the events are made available to our customers as soon as they are emitted (with < 4 sec latency guaranteed).
Interestingly enough, some of our clients didn’t exactly want their data in near-real time manner; for them the data being available a little bit later (say a few mins) was as good, provided they could pay a lesser fee for it.
Now streaming inserts, while crucial to the real-timeliness provided by us; has a cost associated with it and providing a delayed data would mean that we could potentially save some of the costs by batching the inserts into BigQuery!
We went back to take a look at the pricing plans provided by BigQuery and to our surprise, loading data to BigQuery from an external source was free!
What this meant was, if we could store the incoming events to a Google Cloud Storage Bucket and then ask BigQuery to load those saved events, it’s free!
The only price we had to incur was storing and downloading events to and from Google Cloud Storag bucket; which is again free if you are ingressing and egressing data to and from the same region (which we were!).
Having figured out what to do, the only thing left was modifying our dataflow pipeline to Save and Load data into BigQuery instead of Streaming it which it was doing right now.
Thanks to the simple APIs provided by Apache Beam, this turned out to be a less than 5 mins of task.
Now you might be wondering as to why didn’t we decrease the frequency to even a lesser number (say 3 seconds) mimic the real-time nature of Streaming Inserts.
Surely that would have been a good choice, but there are some limitations which prevent us from doing this.
- BigQuery API Limits
As of writing this article, the number of Load Jobs that you can perform on BigQuery is limited to 1000 per day which roughly once per 90 seconds.
Note that this is a hard limit and it can’t be increased
- Increased RAM requirement by Dataflow
Dataflow keeps all the incoming messages in its memory and once the triggering frequency is reached, it uploads them to Google Cloud Storage, hence the new pipeline that we deployed ended up needing 2 cores of CPU alongwith 13.5GB RAM as compared to our old setup which was 1 core CPU and 3.75GB RAM.
While this isn’t something stopping us from implementing near realtime load jobs, it’s something that adds up to your billing costs.
We ended up setting the triggering frequency to 90 seconds, which was enough to keep our clients (who were ok with a delayed latency) happy and helped us reduce unnecessary pricing that we incurred.
Here’s a quick screenshot showcasing the costs that before and after we implemented File Loads.
But it wasn’t all bells and whistels; as you can read in this blog, we ran into yet another issue (which we managed to fix after a weeks of searching around and research).
If you are working at a high growth company and want your data made available to you as soon as it’s created; take a look at https://roobits.com/ and we might be what you are looking for!
Thanks for reading! If you enjoyed this story, please click the 👏 button and share to help others find it! Feel free to leave a comment 💬 below.
Have feedback? Let’s connect on Twitter.