Harshit Dwivedi
Aug 30 · 5 min read

TL;dr : Use file loads with a relatively low frequency instead of Streaming Inserts with Cloud Dataflow

This story is a prequel to the one I wrote here (Yeah, I’m a star wars fan and sequels come before prequels 🤷‍♂️)

If you read the blogpost above, you know that we’re trying to build a real-time data aggregation pipeline and to do so, we’re using the following products within the GCP family :

  1. App Engine : To receive incoming requests from our user’s webpage and app.
  2. PubSub : A scalable and reliant messaging queue that accepts the messages coming to App Engine.
  3. Dataflow : A data processing pipeline which reads messages coming from Pub Sub and transforms them according to our needs.
  4. BigQuery : A data warehouse where all the data ingested by Dataflow is saved to.

The challenge here is obviously to keep the throughput and reliability of the system as high as possible while keeping the latency and the costs to a minimum.

As the current state of things, we handle over 300M events per day and these events are stored into BigQuery as soon as they arrive via Google Cloud Dataflow to ensure that the events are made available to our customers as soon as they are emitted (with < 4 sec latency guaranteed).

Interestingly enough, some of our clients didn’t exactly want their data in near-real time manner; for them the data being available a little bit later (say a few mins) was as good, provided they could pay a lesser fee for it.

Now streaming inserts, while crucial to the real-timeliness provided by us; has a cost associated with it and providing a delayed data would mean that we could potentially save some of the costs by batching the inserts into BigQuery!

We went back to take a look at the pricing plans provided by BigQuery and to our surprise, loading data to BigQuery from an external source was free!

What this meant was, if we could store the incoming events to a Google Cloud Storage Bucket and then ask BigQuery to load those saved events, it’s free!

The only price we had to incur was storing and downloading events to and from Google Cloud Storag bucket; which is again free if you are ingressing and egressing data to and from the same region (which we were!).

Having figured out what to do, the only thing left was modifying our dataflow pipeline to Save and Load data into BigQuery instead of Streaming it which it was doing right now.

Thanks to the simple APIs provided by Apache Beam, this turned out to be a less than 5 mins of task.

Less than 3 lines of code, and you’re done!

Now you might be wondering as to why didn’t we decrease the frequency to even a lesser number (say 3 seconds) mimic the real-time nature of Streaming Inserts.

Surely that would have been a good choice, but there are some limitations which prevent us from doing this.

  1. BigQuery API Limits
    As of writing this article, the number of Load Jobs that you can perform on BigQuery is limited to 1000 per day which roughly once per 90 seconds.
    Note that this is a hard limit and it can’t be increased
  2. Increased RAM requirement by Dataflow
    Dataflow keeps all the incoming messages in its memory and once the triggering frequency is reached, it uploads them to Google Cloud Storage, hence the new pipeline that we deployed ended up needing 2 cores of CPU alongwith 13.5GB RAM as compared to our old setup which was 1 core CPU and 3.75GB RAM.
    While this isn’t something stopping us from implementing near realtime load jobs, it’s something that adds up to your billing costs.

We ended up setting the triggering frequency to 90 seconds, which was enough to keep our clients (who were ok with a delayed latency) happy and helped us reduce unnecessary pricing that we incurred.

Here’s a quick screenshot showcasing the costs that before and after we implemented File Loads.

August 23rd afternoon was when we implemented File Loads

But it wasn’t all bells and whistels; as you can read in this blog, we ran into yet another issue (which we managed to fix after a weeks of searching around and research).

If you are working at a high growth company and want your data made available to you as soon as it’s created; take a look at https://roobits.com/ and we might be what you are looking for!

Thanks for reading! If you enjoyed this story, please click the 👏 button and share to help others find it! Feel free to leave a comment 💬 below.

Have feedback? Let’s connect on Twitter.

📝 Read this story later in Journal.

👩‍💻 Wake up every Sunday morning to the week’s most noteworthy stories in Tech waiting in your inbox. Read the Noteworthy in Tech newsletter.

Google Developers Experts

Experts on various Google products talking tech.

Harshit Dwivedi

Written by

Android Developer, has an *approximate* knowledge of many things. https://harshit.app/

Google Developers Experts

Experts on various Google products talking tech.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade